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There  has  been  extensive  work  in  many  different  fields  on  how  phenomena  of  interest  (e.g.  diseases,  innova¬ 
tion,  product  adoption)  “diffuse”  through  a  social  network.  As  social  networks  increasingly  become  a  fabric 
of  society,  there  is  a  need  to  make  “optimal”  decisions  with  respect  to  an  observed  model  of  diffusion.  For 
example,  in  epidemiology,  officials  want  to  find  a  set  of  k  individuals  in  a  social  network  which,  if  treated, 
would  minimize  spread  of  a  disease.  In  marketing,  campaign  managers  try  to  identify  a  set  of  k  customers 
that,  if  given  a  free  sample,  would  generate  maximal  “buzz”  about  the  product.  In  this  paper,  we  first  show 
that  the  well-known  Generalized  Annotated  Program  (GAP)  paradigm  can  be  used  to  express  many  existing 
diffusion  models.  We  then  define  a  class  of  problems  called  Social  Network  Diffusion  Optimization  Problems 
(SNDOPs).  SNDOPs  have  four  parts:  (i)  a  diffusion  model  expressed  as  a  GAP,  (ii)  an  objective  function  we 
want  to  optimize  with  respect  to  a  given  diffusion  model,  (iii)  an  integer  k  >  0  describing  resources  (e.g.  med¬ 
ication)  that  can  be  placed  at  nodes,  (iv)  a  logical  condition  V C  that  governs  which  nodes  can  have  a  resource 
(e.g.  only  children  above  the  age  of  5  can  be  treated  with  a  given  medication).  We  study  the  computational 
complexity  of  SNDOPs  and  show  both  NP-completeness  results  as  well  as  results  on  complexity  of  approxi¬ 
mation.  We  then  develop  an  exact  and  a  heuristic  algorithm  to  solve  a  large  class  of  SNDOP  problems  and 
show  that  our  GREEDY-SNDOP  algorithm  achieves  the  best  possible  approximation  ratio  that  a  polynomial 
algorithm  can  achieve  (unless  P  =  NP).  We  conclude  with  a  prototype  experimental  implementation  to 
solve  SNDOPs  that  looks  at  a  real-world  Wikipedia  data  set  consisting  of  over  103,000  edges. 
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1.  INTRODUCTION 

There  is  a  rapid  proliferation  of  different  types  of  graph  data  in  the  world  today. 
These  include  social  network  data  (FaceBook,  Flickr,  YouTube,  etc.),  cell  phone  net¬ 
work  data  [N.  Eagle  and  Lazer  2008]  collected  by  virtually  all  cell  phone  vendors,  email 
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network  data  (such  as  those  derived  from  the  Enron  corpus1),  as  well  as  information 
on  disease  networks  [Coelho  et  al.  2008;  Anderson  and  May  1979].  In  addition,  the 
World  Wide  Consortium’s  RDF  standard  is  also  a  graph-based  standard  for  encoding 
semantic  information  contained  in  web  pages.  There  has  been  years  of  work  on  ana¬ 
lyzing  how  various  properties  of  nodes  in  such  networks  “diffuse”  through  the  network 
-  different  techniques  have  been  invented  in  different  academic  disciplines  including 
economics  [Jackson  and  Yariv  2005;  Schelling  1978],  infectious  diseases  [Coelho  et  al. 
2008],  sociology  [Granovetter  1978]  and  computer  science  [Kempe  et  al.  2003]. 

Past  work  on  diffusion  has  several  limitations,  (i)  First,  they  largely  assume  that 
a  social  network  is  nothing  but  a  set  of  vertices  and  edges  [Watts  1999;  Cowan  and 
Jonard  2004;  Rychtar  and  Stadler  2008].  In  contrast,  in  this  paper  we  adopt  a  richer 
model  where  edges  and  vertices  can  both  be  labeled  with  properties.  For  instance,  a  po¬ 
litical  campaigner  hoping  to  spread  a  positive  message  about  a  campaign  needs  to  use 
demographics  (e.g.  sex,  age  group,  educational  level,  group  affiliations,  etc.)  for  target¬ 
ing  a  political  message  —  a  “one  size  fits  all”  message  will  not  work.  In  general,  social 
network  researchers  would  say  that  they  have  several  sociomatrices  that  can  be  used 
for  such  applications,  (ii)  Second,  past  work  on  diffusion  has  no  notion  of  “strength” 
associated  with  edges.  It  may  well  be  the  case,  in  many  applications,  that  the  degree 
of  contact  between  two  vertices  (e.g.  number  of  minutes  person  A  spends  on  the  cell 
phone  with  person  B)  is  a  proxy  for  the  strength  of  the  relationship  between  A  and 
B,  which  in  turn  may  have  an  impact  of  whether  A  can  influence  B  or  not.  (iii)  Third, 
these  past  frameworks  [Jackson  and  Yariv  2005;  Schelling  1978;  Coelho  et  al.  2008; 
Granovetter  1978]  usually  reason  about  a  single  diffusion  model,  rather  than  develop 
a  framework  for  reasoning  about  a  whole  class  of  diffusion  models. 

Past  diffusion  models  developed  in  a  variety  of  fields  ranging  from  business  [Jackson 
and  Yariv  2005],  economics  [Schelling  1978],  social  science  [Granovetter  1978],  epi¬ 
demiology  [Coelho  et  al.  2008;  Hethcote  1976;  Anderson  and  May  1979],  mobile  phone 
usage  [Aral  et  al.  2009]  show  that  diffusion  models  vary  dramatically  from  application 
to  application.  Three  broad  categories  of  diffusion  models  exist. 

(1)  Cascade  models  [Coelho  et  al.  2008;  Hethcote  1976;  Anderson  and  May  1979]  are 
widespread  in  epidemiology  and  assume  that  diffusions  are  largely  based  on  con¬ 
nectivity  between  nodes  and  are  largely  probabilistic. 

(2)  Tipping  models  do  not  use  probabilities,  but  use  various  quantitative  calculations 
to  determine  when  a  vertex  adopts  (or  is  infected  with)  a  diffusive  property.  They 
are  omnipresent  in  the  social  sciences  and  business  [Centola  2010;  Jackson  and 
Yariv  2005;  Granovetter  1978].  Nobel-laureate  Tom  Schelling  makes  a  similar 
point  that  diffusions  in  many  social  science  applications  have  a  tipping  point  when 
vertices  become  influenced  by  the  number  of  neighbors  and  the  strength  of  com¬ 
mitment  the  neighbors  may  have  to  a  certain  position.  No  probabilities  are  present 
in  such  models. 

(3)  Homophilic  models  are  ones  where  similarity  between  users,  rather  than  networks 
effects,  dominate  diffusion.  Similarity  is  usually  calculated  using  some  quantita¬ 
tive  model,  often  related  to  distance  between  vectors  representing  (values  of)  prop¬ 
erties  of  nodes.  For  example,  [Aral  et  al.  2009]  tracks  adoption  of  mobile  applica¬ 
tions  in  a  study  of  over  27M  users  and  shows  that  homophily  -  similarity  between 
users  -  is  the  most  compelling  diffusion  model.  There  are  no  probabilities  here, 
just  similarity  measures.  Another  world  famous  diffusion  model  focused  on  mar¬ 
keting  [Watts  and  Peretti  2007]  also  is  based  on  homophily  and  similarity  of  nodes’ 
intrinsic  properties  rather  than  a  probability. 


1  http  ://ww  w.  cs.  emu.  edu/~enron/ 
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Moreover,  many  models  use  a  mix  of  the  above  forms.  For  instance,  [Cha  et  al.  2009] 
argues  that  the  way  photos  are  marked  as  “favorites”  on  Flickr  is  based  on  a  mix  of 
cascading  and  homophilic  behavior  and  to  study  the  former,  one  must  also  account 
for  the  latter.  A  similar  combination  of  cascading  and  tipping  is  observed  in  [Zhang 
2011].  Another  strong  indication  of  hybrid  models  in  real  social  networks  is  the  note¬ 
worthy  experimental  study  of  [Centola  2011]  which  illustrates  how  a  tipping  model 
combined  with  homphilic  effects  promote  diffusion  of  health  behaviors  in  an  online 
network.  Thus,  any  general  framework  for  expressing  diffusions  must  have  the  capabil¬ 
ity  to  express  all  three  types  of  diffusion  models,  not  just  one  or  the  other.  In  general,  a 
language  to  express  diffusion  models  must  be  capable  of  expressing  a  wide  variety  of 
quantitative  methods  encapsulated  in  the  above. 

In  this  paper,  we  first  show  that  a  class  of  the  well-known  Generalized  Annotated 
Program  (GAP)  paradigm  [Kifer  and  Subrahmanian  1992;  Kifer  and  Lozinskii  1992; 
Thirunarayan  and  Kifer  1993]  and  their  variants  [Vennekens  et  al.  2004;  Krajci  et  al. 
2004;  Lu  1996;  Lu  et  al.  1993;  Damasio  et  al.  1999]  including  Linear  GAPs  (introduced 
here)  form  a  convenient  method  to  express  many  diffusion  models.  Though  there  is 
no  claim  that  they  can  express  all  possible  useful  diffusion  models,  they  do  express 
all  diffusion  models  (over  30)  we  have  studied  in  the  literature  on  a  wide  variety  of 
topics.  Moreover,  [Broecheler  et  al.  2010]  provides  an  algorithm  to  automatically  learn 
such  diffusion  models  from  historical  data,  so  users  do  not  need  to  write  their  diffusion 
models  by  themselves.  This  provides  greater  confidence  that  these  diffusion  models 
are  “correct.”  Many  other  papers  also  focus  on  learning  diffusion  models  automatically 
for  different  types  of  applications  —  [Leskovec  et  al.  2007a]  develop  a  probabilistic 
learning  algorithm,  while  [Backstrom  et  al.  2006]  develop  a  method  that  takes  both  the 
properties  of  vertices  and  the  strength  of  relationships  between  vertices  to  learn  such 
a  diffusion  model  automatically.  We  expect  that  in  most  real-world  applications  going 
forward,  diffusion  models  will  be  automatically  learned  rather  than  being  programmed 
by  logic  programmers. 

Next,  unlike  most  existing  work  in  social  networks  which  focus  on  learning  diffusion 
models,  we  focus  on  reasoning  with  diffusion  models  (expressed  via  GAPs)  after  the 
diffusion  models  have  been  learned.  In  particular,  we  consider  the  problem  of  optimal 
decision  making  in  social  networks  which  have  associated  diffusion  models  expressible 
as  Linear  GAPs,  though  many  of  the  results  in  the  paper  apply  to  arbitrary  GAPs  as 
well.  Here  are  two  examples. 

—  (Ql)  Cell  phone  plans.  A  cell  phone  company  is  promoting  a  new  cell  phone  plan 
-  as  a  promotion,  it  is  giving  away  k  free  plans  to  existing  customers.2  Which  set  of 
k  people  should  they  pick  so  as  to  maximize  the  number  of  plan  adoptees  predicted 
by  a  cell  phone  plan  adoption  diffusion  model  they  have  learned  from  their  past 
promotions? 

—  (Q2)  Medication  distribution  plan.  A  government  combating  a  disease  spread 
by  physical  contact  has  limited  stocks  of  free  medication  to  give  away.  Based  on  a 
diffusion  model  of  how  the  disease  spreads  (e.g.  kids  might  be  more  susceptible  than 
adults,  those  previously  inoculated  against  the  disease  are  safe,  etc.),  they  want 
to  find  a  set  of  k  people  who  (jointly)  maximally  spread  the  disease  when  infected 
(so  that  they  can  provide  immediate  treatment  to  these  k  people  in  an  attempt  to 


2  Our  framework  allows  us  to  add  additional  constraints  —  for  instance,  that  plans  can  only  be  given  to 
customers  satisfying  certain  conditions,  e.g.customers  deemed  to  be  “good”  according  to  various  business 
criteria. 
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halt  the  disease’s  spread).3  Notice  that  this  query  corresponds  to  only  one  of  many 
different  policies  that  can  be  considered  to  deal  with  the  disease  spread  scenario, 
that  is,  we  consider  the  case  where  a  diffusion  model  expressing  how  an  infected 
person  can  infect  other  people  is  available  and  formulate  a  query  that  looks  at  the 
maximum  spread  when  k  people  are  infected.  Other  queries,  possibly  leading  to 
different  answers  about  who  should  be  treated  with  medications,  are  possible. 

Both  these  problems  are  instances  of  a  class  of  queries  that  we  call  Social  Network 
Diffusion  Optimization  Problem  (SNDOP)  queries.  They  differ  from  queries  studied 
in  the  past  in  quantitative  (both  probabilistic  and  annotated)  logic  programming  in 
two  fundamental  ways:  (i)  They  are  specialized  to  operate  on  graph  data  where  the 
graph’s  vertices  and  edges  are  labeled  with  properties  and  where  the  edges  can  have 
associated  weights,  (ii)  They  find  sets  of  vertices  that  optimize  complex  objective  func¬ 
tions  that  can  be  specified  by  the  user.  Neither  of  these  has  been  studied  before  by  any 
kind  of  quantitative  logic  programming  framework,  though  work  on  optimizing  objec¬ 
tive  functions  in  the  context  of  different  types  of  semantics  (minimal  model  and  stable 
model  semantics)  has  been  studied  before  [Leone  et  al.  2004].  And  of  course,  constraint 
logic  programming  [Apt  2003]  has  also  extensively  studied  optimization  issues  as  well 
in  logic  programming  -  however,  here,  optimization  and  constraint  solving  is  embedded 
in  the  constraint  logic  program,  whereas  in  our  case,  they  are  part  of  the  query  over  an 
annotated  logic  program.  Moreover,  most  measures  of  importance  in  social  networks 
are  centrality  measures  that  study  the  influence  of  single  vertices  -  [Borgatti  and  Ev¬ 
erett  2006]  provides  an  excellent  overview  of  centrality  measures.  In  contrast,  a  set 
of  k  nodes  each  with  low  individual  centrality  may  often  wield  greater  influence  on 
a  network  than  the  set  consisting  of  the  k  nodes  with  highest  individual  centrality  - 
intuitively,  this  is  due  to  the  fact  that  the  k  nodes  with  highest  individual  centrality 
may  overlap  greatly  in  the  nodes  they  influence,  leading  to  an  aggregate  number  of 
influenced  nodes  that  is  lower  than  the  one  in  the  first  case. 

This  paper  is  organized  as  follows.  In  Section  2,  we  provide  an  overview  of  GAPs 
(past  work),  define  a  social  network  (SN  for  short),  and  explain  how  GAPs  can  repre¬ 
sent  some  types  of  diffusion  in  SNs.  Section  3  formally  defines  different  types  of  social 
network  diffusion  optimization  problems  and  provides  results  on  their  computational 
complexity  and  other  properties.  Section  4  shows  how  our  framework  can  represent 
several  existing  diffusion  models  for  social  networks  including  economics  and  epidemi¬ 
ology.  In  Section  5  we  present  the  exact  SNDOP-Mon  algorithm  to  answer  SNDOP 
queries  under  certain  assumptions  of  monotonicity.  We  then  develop  a  greedy  algo¬ 
rithm  GREEDY-SNDOP  and  show  that  under  certain  conditions,  it  is  guaranteed  to  be 
an  (jzy)  approximation  algorithm  for  SNDOP  queries  —  this  is  the  best  possible  ap¬ 
proximation  guarantee.  Last,  but  not  least,  we  describe  our  prototype  implementation 
and  experiments  in  Section  6.  Specifically,  we  tested  our  GREEDY-SNDOP  algorithm 
on  a  real-world  social  network  data  set  consisting  of  over  7000  nodes  and  over  103,000 
edges  from  Wikipedia  logs.  We  show  that  we  solve  social  network  diffusion  optimiza¬ 
tion  problems  over  real  data  sets  in  acceptable  times.  We  emphasize  that  much  addi¬ 
tional  work  is  required  on  further  enhancing  scalability  and  that  research  on  social 
network  diffusion  optimization  problems  is  at  its  very  infancy.  Finally,  in  Section  7,  we 
review  related  work. 


3Again,  our  framework  allows  us  to  add  additional  constraints  —  for  instance,  that  medication  can  only  be 
given  to  people  satisfying  certain  conditions,  e.g.  be  over  a  certain  age,  or  be  within  a  certain  age  range  and 
not  have  any  conditions  that  are  contra-indicators  for  the  medication  in  question. 
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2.  TECHNICAL  PRELIMINARIES 

In  this  section,  we  first  formalize  social  networks,  then  briefly  review  generalized  an¬ 
notated  logic  programs  (GAPs)  [Kifer  and  Subrahmanian  1992]  and  then  describe  how 
GAPs  can  be  used  to  represent  concepts  related  to  diffusion  in  SNs. 

2.1.  Social  Networks  Formalized 

Throughout  this  paper,  we  assume  the  existence  of  two  arbitrary  but  fixed  disjoint  sets 
VP,  EP  of  vertex  and  edge  predicate  symbols  respectively.  Each  vertex  predicate  symbol 
has  arity  1  and  each  edge  predicate  symbol  has  arity  2. 

Definition  2.1.  A  social  network  is  a  5-tuple  (V,  E,  lvert,  (-edge,  w)  where: 

(1)  V  is  a  finite  set  whose  elements  are  called  vertices. 

(2)  E  C  V  x  V  is  a  finite  multi-set  whose  elements  are  called  edges. 

(3)  £vert  '■  V  — >  2VP  is  a  function,  called  vertex  labeling  function. 

(4)  (edge  '■  E  — >  EP  is  a  function,  called  edge  labeling  function.  4 

(5)  w  :  E  — ►  [0, 1]  is  a  function,  called  weight  function. 

We  now  present  a  brief  example  of  an  SN. 

Example  2.2.  Let  us  return  to  the  cell  phone  example  (query  (Ql)).  Figure  1  shows 
a  toy  SN  the  cell  phone  company  might  use.  Here,  we  might  have  VP  =  {male,  female, 
adopter,  temp-adopter,  non-adopter}  denoting  the  sex  and  past  adoption  behavior  of  each 
vertex;  EP  might  be  the  set  {phone,  email,  I M}  denoting  the  types  of  interactions  be¬ 
tween  vertices  (phone  call,  email,  and  instant  messaging  respectively).  The  function 
£vert  is  shown  in  Figure  1  by  the  shape  (denoting  past  adoption  status)  and  shad¬ 
ing  (male/female).  The  type  of  edges  (bold  for  phone,  dashed  for  email,  dotted  for  IM) 
is  used  to  depict  £edge •  w{{v i,v2))  denotes  the  percentage  of  communications  of  type 
£edge((vi,v2))  initiated  by  v\  that  were  with  v2  (measured  either  w.r.t.  time  or  bytes). 

It  is  important  to  note  that  our  definition  of  social  networks  is  much  broader  than 
that  used  by  several  researchers  [Anderson  and  May  1979;  Coelho  et  al.  2008;  Jackson 
and  Yariv  2005;  Kempe  et  al.  2003]  who  often  do  not  consider  either  £edge  or  £vert  or 
edge  weights  through  the  function  w  —  it  is  well-known  in  marketing  that  intrinsic 
properties  of  vertices  (customers,  patients)  and  the  nature  and  strength  of  the  rela¬ 
tionships  (edges)  is  critical  for  decision  making  in  those  fields. 

Note.  We  assume  that  SNs  satisfy  various  integrity  constraints.  In  Example  2.2,  it  is 
clear  that  £vert(v)  should  include  at  most  one  of  male,  female  and  at  most  one  of  adopter, 
temp -adopter, non  .adopter.  We  assume  the  existence  of  some  integrity  constraints  to  en¬ 
sure  this  kind  of  semantic  integrity  -  they  can  be  written  in  any  reasonable  syntax 
to  express  ICs  -  in  the  rest  of  this  paper,  we  assume  that  social  networks  have  asso¬ 
ciated  ICs  and  that  they  satisfy  them.  In  our  example,  we  will  assume  ICs  ensuring 
that  a  vertex  can  be  marked  with  at  most  one  of  male,  female  and  at  most  one  of 
adopter,  temp.adopter,  non-adopter. 

2.2.  Generalized  Annotated  Programs:  A  Recap 

We  now  recapitulate  the  definition  of  generalized  annotated  logic  programs  from  [Kifer 
and  Subrahmanian  1992].  We  assume  the  existence  of  a  set  AVar  of  variable  symbols 
ranging  over  the  unit  real  interval  [0, 1]  and  a  set  T  of  function  symbols  each  of  which 
has  an  associated  arity.  We  start  by  defining  annotations. 


4Each  edge  e  G  E  is  labeled  by  exactly  one  predicate  symbol  from  EP.  However,  there  can  be  multiple  edges 
between  two  vertices  labeled  with  different  predicate  symbols. 
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Fig.  1.  Example  cellular  network. 

Definition  2.3  (Annotation).  Annotations  are  inductively  defined  as  follows:  (i)  Any 
member  of  [0, 1]  U  AVar  is  an  annotation. 

(ii)  If  /  G  T  is  an  n-ary  function  symbol  and  t\ , . . . , tn  are  annotations,  then  f(ti, . . .  ,t„) 
is  an  annotation. 

For  instance,  0.5, 1, 0.3  and  X  are  all  annotations  (here  X  is  assumed  to  be  a  variable 
in  AVar).  If +,  *,  /  are  all  binary  function  symbols  in  T,  then  ^^g*0,5  is  an  annotation.5 

We  define  a  separate  logical  language  whose  constants  are  members  of  V  and  whose 
predicate  symbols  consist  of  VP  U  EP.  We  also  assume  the  existence  of  a  set  V  of  vari¬ 
able  symbols  ranging  over  the  constants  (vertices).  No  function  symbols  are  present. 
Terms  and  atoms  are  defined  in  the  usual  way  (cf.  [Lloyd  1987]).  If  A  =  p(ti, . . . ,  tn)  is 
an  atom  and  P  e  VP  (resp.  p  G  EP),  then  A  is  called  a  vertex  (resp.  edge)  atom.  We  will 
use  A  to  denote  the  set  of  all  ground  atoms  (i.e.,  atoms  where  no  variable  occurs). 

Definition  2.4  ( annotated  atom  /  GAP -rule  /  GAP).  If  A  is  an  atom  and  //  is  an  anno¬ 
tation,  then  A  :  g  is  an  annotated  atom.  If  A  is  a  vertex  (resp.  edge)  atom,  then  A  :  g 
is  also  called  vertex  (resp.  edge)  annotated  atom.  If  A0  :  go,Ai  :  g\ ,...,An  :  gn  are 
annotated  atoms,  then 


Ao  :  go  G-  A\  :  g\  A  . . .  A  An  :  gn 

is  called  a  GAP  rule  (or  simply  rule).  When  n  =  0,  the  above  rule  is  called  a  fact.6  A 
generalized  annotated  program  (GAP)  is  a  finite  set  of  rules.  An  annotated  atom  (resp. 
a  rule,  a  GAP)  is  ground  iff  there  are  no  occurrences  of  variables  from  either  AVar  or  V 
in  it. 


'Notice  that  in  [Kifer  and  Subrahmanian  1992]  annotations  are  not  restricted  to  be  in  [0, 1]  but  any  upper 
semi-lattice  is  allowed  -  for  the  purpose  of  this  paper  we  will  restrict  ourselves  to  the  unit  real  interval. 
6For  notational  simplicity,  we  will  often  write  a  fact  Aq  :  uo  <—  simply  as  Ao  :  uo,  i.e.  we  drop  the  symbol 
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Every  social  network  S  =  (V,  E,£vert,£edge,  w)  can  be  represented  by  the  GAP 

IIs  =  {g(v)  :  1  <—  |  r  G  V  A  q  €  £Vert(v)}  U  {ep(v\,v2)  :  w((v  1,^2))  A-  |  (vi,v2)  G 
E  A  tedge({v  1,V2))  =  ep}. 

Definition  2.5  ( embedded  social  network).  A  social  network  5  is  said  to  be  embed¬ 
ded  in  a  GAP  II  iff  IIs  C  II. 

It  is  clear  that  all  social  networks  can  be  represented  as  GAPs.  When  we  augment  1 1 5 
with  other  rules  —  such  as  rules  describing  how  certain  properties  diffuse  through  the 
social  network,  we  get  a  GAP  II  D  IIs  that  captures  both  the  structure  of  the  SN  and 
the  diffusion  principles.  Here  is  a  small  example  of  such  a  GAP. 

Example  2.6.  The  GAP  nce«  might  consist  of  IIs  using  the  social  network  of  Fig¬ 
ure  1  plus  the  GAP-rules: 

(1)  will  .adopt  (Vo)  :  0.8  x  A'  +  0.2  ■$—  adopter(Vo)  :  1  A  male(Vo)  :  1  A 
IM(Vo,V 1)  :  0.3  A  femalely 1)  :  1  A  wilLadopt(\ 1)  :  X. 

(2)  will-adopt(Vo )  :  0.9  xl  +  O.lf-  adopter  (Vo)  :  1  A  male(Vo)  :  1  A 
IM(Vo,V 1)  :  0.3  A  male(V 1)  :  1  A  wilLadopt(V 1)  :  A. 

(3)  wilLadopt(Vo)  :  1  temp -adopter (Vo)  :  1  A  male(Vo)  :  1  A  email(V\,Vo)  :  1A  female(V 1)  : 

1  A  willjidopt(V\)  :  1. 

Rule  (1)  says  that  if  V0  is  a  male  adopter  and  Vi  is  female  and  the  weight  of  V0’s 
instant  messages  to  V\  is  0.3  or  more,  and  we  previously  thought  that  Vi  would  be  an 
adopter  with  confidence  X,  then  we  can  infer  that  V0  will  adopt  the  new  plan  with 
confidence  0.8  x  X  +  0.2.  The  other  rules  may  be  similarly  read. 

Suppose  S  is  a  social  network  and  II  D  ns  is  a  GAP.  In  this  case,  we  call  the  rules 
in  II  —  IIs  diffusion  rules.  In  this  paper  we  consider  a  restricted  class  of  GAPs:  every 
rule  with  a  non-empty  body  has  a  vertex  annotated  atom  in  the  head  ([Kifer  and  Sub- 
rahmanian  1992]  allows  any  atom  to  appear  in  the  head  of  a  rule).  Thus,  edge  atoms 
can  appear  only  in  rule  bodies  or  facts.  This  means  that  neither  edge  weights  nor  edge 
labels  change  as  the  result  of  the  diffusion.  However,  for  the  general  case,  it  is  possible 
for  them  to  change  as  a  result  of  the  diffusion  process. 

GAPs  have  a  formal  semantics  that  can  be  immediately  used.  An  interpretation  I 
is  any  mapping  from  the  set  A  of  all  grounds  atoms  to  [0, 1].  The  set  I  of  all  interpre¬ 
tations  can  be  partially  ordered  via  the  ordering:  /,  A  /2  iff  for  all  ground  atoms  A, 
I\(A)  <  I2(A).  X  forms  a  complete  lattice  under  the  A  ordering. 

Definition  2.7  (satisfaction  / entailment) .  An  interpretation  I  satisfies  a  ground  an¬ 
notated  atom  A  :  p,  denoted  I  \=  A  :  p,  iff  1(A)  >  p.  I  satisfies  a  ground  GAP-rule  r 
of  the  form  AA0  <—  AA\  A  ...  A  AAn  (denoted  I  [=  ?’)  iff  either  (i)  I  satisfies  AA0  or 
(ii)  there  exists  an  1  <  i  <  n  such  that  I  does  not  satisfy  AAt.  I  satisfies  a  non-ground 
annotated  atom  (rule)  iff  I  satisfies  all  ground  instances  of  it.  I  satisfies  a  GAP  iff  I 
satisfies  all  rules  in  it.  A  GAP  II  entails  an  annotated  atom  AA,  denoted  II  (=  AA,  iff 
every  interpretation  I  that  satisfies  II  also  satisfies  AA. 

As  shown  by  [Kifer  and  Subrahmanian  1992],  we  can  associate  a  fixpoint  operator  with 
any  GAP  II  that  maps  interpretations  to  interpretations. 

Definition  2.8.  Suppose  II  is  any  GAP  and  I  an  interpretation.  The  mapping  Tn 
that  maps  interpretations  to  interpretations  is  defined  as  Tn(/)(A)  =  sup{/j  |  A  :  p  4- 
AAi  A  ...  A  AAn  is  a  ground  instance  of  a  rule  in  II  and  for  all  1  <  i  <  n,  I  \=  AAi}. 

[Kifer  and  Subrahmanian  1992]  show  that  Tn  is  monotonic  (w.r.t.  A)  and  has  a  least 
fixpoint  lfp( Tn).  Moreover,  they  show  that  II  entails  A  :  //  iff  p  <  lfp(Tu){A)  and 
hence  lfp( Tn)  precisely  captures  the  ground  atomic  logical  consequences  of  II.  They 
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also  define  the  iteration  of  Tn  as  follows:  Tn  t  0  is  the  interpretation  that  assigns  0  to 
all  ground  atoms;  Tn  t  (*  +  1)  =  Tn(Tn  t  *)• 

The  semantics  of  GAPs  requires  that  when  there  are  multiple  ground  instances 
of  GAP-rules  with  the  same  head  that  “fire”,  the  highest  annotation  in  any  of  these 
ground  rules  is  “chosen”  according  to  the  semantics  of  GAPs.  This  might  seem  restric¬ 
tive  and  counter-intuitive  to  some,  but  it  actually  is  the  source  of  much  power  of  GAPs. 
For  instance,  one  school  of  thought  in  probabilistic  logic  programming  [Raedt  et  al. 
2007]  is  that  when  multiple  ground  rules  with  the  same  head  “fire”,  the  annotation 
derived  should  be  the  “noisy-or”  value  derived  by  combining  the  values  of  the  annota¬ 
tions  in  the  heads  of  firing  rules.  However,  this  is  just  one  way  of  combining  evidence 
from  multiple  sources7  -  many  other  triangular  co-norms  other  than  noisy-or  can  be 
used  and  have  been  used  in  the  literature  [Bonissone  1987a].  However,  such  co-norms 
can  be  expressed  in  our  framework.  If  we  have  ground  rules  G\,  G2,  •  •  • ,  Gn,  each  hav¬ 
ing  the  same  atom  in  the  head,  and  we  want  to  combine  evidence  using  a  triangular 
co-norm8  ®,  and  if  G;  has  the  form: 

A  :  /.q  ©-  Bodyt 

then  we  can  replace  these  rules  with  the  rules: 

A  :  ®({/q  |  i  ©  A})  f\  Bodyi 
iex 

for  any  subset  X  C  {l,...,n}.  Moreover,  as  we  have  already  remarked,  many  real- 
world  diffusion  models  are  non-probabilistic,  making  assumptions  about  how  annota¬ 
tions  should  be  combined  harder  to  justify.  However,  the  above  discussion  shows  that 
the  GAP  framework  is  capable  of  expressing  such  rules.  Though  there  is  clearly  a 
cost  in  terms  of  difficulty  of  expressing  such  methods  to  combine  evidence  generated 
by  multiple  rules,  algorithms  already  exist  and  have  been  implemented  ([Broecheler 
et  al.  2010])  to  learn  GAP-based  diffusion  rules  automatically  from  social  network  time 
series  data. 

We  will  show  (in  Section  4)  that  many  existing  diffusion  models  for  a  variety  of 
phenomena  can  be  expressed  as  a  GAP  II  ®  I  [5  by  adding  some  GAP-rules  describing 
the  diffusion  process  to  1 15. 

3.  SOCIAL  NETWORK  DIFFUSION  OPTIMIZATION  PROBLEM  (SNDOP)  QUERIES 
3.1.  Basic  SNDOP  Queries 

In  this  section,  we  develop  a  formal  syntax  and  semantics  for  optimization  in  social 
networks,  taking  advantage  of  the  aforementioned  embedding  of  SNs  into  GAPs.  In 
particular,  we  formally  define  SNDOP  queries,  examples  of  which  have  been  infor¬ 
mally  introduced  earlier  as  (Ql)  and  (Q2).  We  see  from  queries  (Ql)  and  (Q2)  that  a 
SNDOP  query  looks  for  a  set  V'  of  vertices  and  has  the  following  components:  (i)  an 
objective  function  expressed  via  an  aggregate  operator,  (ii)  an  integer  k  >  0,  (iii)  a  set 
of  conditions  that  each  vertex  in  V'  must  satisfy,  (iv)  an  “input”  atom  gi{V),  and  (v) 
an  “output”  atom  go  ( V )  (here  gj  and  go  are  vertex  predicate  symbols,  whereas  V  is  a 
variable). 


'Thus  far,  we  have  not  come  across  any  real-world  diffusion  models  that  use  noisy-OR  combinations  or 
indeed  any  triangular  co-norm  [Bonissone  1987a]  other  than  the  MAX  used  in  this  paper  to  combine  values 
generated  by  multiple  rules  having  the  same  head.  However  should  such  diffusion  models  come  to  light,  it 
may  be  appropriate  to  explore  the  use  of  languages  such  as  ProbLog  to  see  if  we  can  “do  better”  for  those 
selected  diffusion  models. 

8 When  we  apply  ®  to  a  set  (an, . . . ,  x^},  we  use  ©({an, . . . ,  Xk})  as  short-hand  for  ©(au,  ®({a;2, . . . ,  xn})) 
which  is  well  defined  as  all  triangular  co-norms  are  commutative  and  associative. 
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Aggregates.  It  is  clear  that  in  order  to  express  queries  like  (Ql)  and  (Q2),  we  need 
aggregate  operators  which  are  mappings  agg  :  FM([0, 1])  —>  R  1  (M-  is  the  set  of  non¬ 
negative  reals)  where  FM(X)  denotes  the  set  of  all  finite  multisets  that  are  subsets  of 
X.  Relational  DB  aggregates  like  SUM, COUNT, AVG, MIN, MAX  are  all  aggregate  opera¬ 
tors  which  can  take  a  finite  multiset  of  non-negative  reals  as  input  and  return  a  single 
non-negative  real. 

Vertex  condition.  A  vertex  condition  is  a  set  of  vertex  annotated  atoms  containing 
exactly  one  variable  (intuitively,  such  annotated  atoms  are  conditions  that  must  be 
jointly  satisfied  by  a  vertex).  More  formally,  a  vertex  condition  VC  is  a  set  {pi(V)  : 
til,...  ,pn{V)  :  pn}  where  each  pt  e  VP,  V  €  V,  and  each  e  [0, 1],  We  use  VC\V/v\ 
to  denote  the  set  of  ground  annotated  atoms  obtained  from  VC  by  replacing  each  oc¬ 
currence  of  V  with  v,  that  is  VC\V/v]  =  (pi(f)  :  pi, . .  ■  ,pn(i>)  :  pn}.  A  GAP  II  entails 
VC\V/v),  denoted  II  (=  VC[V/r],  iff  II  \=  pj{v)  :  pi  for  all  1  <  i  <  n. 

Thus,  in  our  example,  {male{V)  :  1,  adopter(V )  :  1}  is  a  vertex  condition,  but 
{male{V)  :  1,  email(V,  V')  :  1}  is  not.  We  are  now  ready  to  define  a  SNDOP  query. 

Definition  3.1  {SNDOP  query).  A  SNDOP  query  is  a  5-tuple 
{agg,VC,k,gi(V),go{V))  where  agg  is  an  aggregate,  VC  is  a  vertex  condition, 
k  >  0  is  an  integer,  and  gi(V),  go{V)  are  vertex  atoms. 

Let  us  consider  again  the  medication  distribution  plan  example.  Suppose  we  have  a 
diffusion  model  expressing  how  a  property  healthy  diffuses  in  a  social  network  w.r.t. 
a  property  immune  (which  would  hold  for  a  vertex  when  a  medication  is  given  to 
it).  An  interesting  query  to  pose  would  be  to  determine  a  set  of  at  most  k  people 
such  that  if  these  people  were  immune  to  the  disease,  then  the  number  of  healthy 
people  would  be  maximized.  Such  a  query  can  be  expressed  with  the  SNDOP  query 
(SUM,  0,  k,  immune(V),  healthy {V)).  Here,  the  goal  is  to  find  a  set  V'  C  V  of  vertices 
such  that  |V'|  <  k  and  the  following  is  maximized: 

SUM{f/p(Tnu{imrmine(w'):i  |  v‘ eV/}) {healthy {v))  |  V  G  V} 

Here,  the  SUM  is  applied  to  a  multiset  rather  than  a  set.  Note  that  in  the  query  above 
VC  =  0,  meaning  that  the  immune  property  can  be  assigned  to  any  vertex  of  the 
SN.  However,  other  queries  can  be  expressed  where  VC  imposes  restrictions  on  which 
vertices  can  have  property  immune.  As  an  example,  VC  =  {adult{V)}  would  enforce 
every  vertex  in  V'  to  be  an  adult  person. 

If  we  return  to  our  cell  phone  example,  we  can  set  agg  =  SUM,  VC  =  0,  k  =  3  (for 
example),  gi{V)  =  wilLadopt(V),  and  goiV)  =  wilLadopt{V)  (notice  that  in  this  case 
gi{V)  =  <7o(V)).  Here  also,  the  goal  is  to  find  a  set  V'  C  V  of  vertices  such  that  |V'|  <  3 
and  the  following  is  maximized: 

SUM{lfp{Tnu{will  adopt(v>y.i  |  v>ev}){wilLadopt{v))  |  w  e  V} 

Here,  the  SUM  is  applied  to  a  multiset  rather  than  a  set.  Note  that  the  diffusion 
model’s  impact  is  captured  via  the  l fp{Tuu{wiii-adoPt(v'):i  \  v' ev1}) {will -adopt {v ))  expres¬ 
sion  which,  intuitively,  tells  us  the  confidence  (according  to  the  diffusion  model)  that 
each  vertex  v  will  be  an  adopter.  If  we  return  to  an  extended  version  of  our  cell  phone 
example  and  we  want  to  ensure  that  the  vertices  in  V'  are  “good”  customers  9  then  we 
merely  can  set  VC  =  {good{V)  :  1}.  This  query  now  asks  us  to  find  a  set  V'  of  three 
or  less  vertices  —  all  of  which  are  “good”  customers  of  the  company  C  —  such  that 
SUM{lfp{Tnu{will  adoPt(v’y.1\v'eV}){'will-adopt{v))  \  v  €  V}  is  maximized. 


9We  can  think  of  many  ways  a  company  may  define  “good”  customers,  e.g.  those  who  regularly  pay  their  bills 
on  time,  those  who  buy  a  lot  of  services  from  the  company,  those  who  have  stayed  as  customers  for  a  long 
time,  etc.  For  our  example,  the  specific  definition  of  “good”  is  not  relevant. 
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Our  framework  also  allows  the  vertex  condition  V C  to  have  annotations  other  than 
1.  So  in  our  cell  phone  example,  the  company  could  explicitly  exclude  anyone  whose 
“opinion”  toward  the  company  is  negative.  If  opinion  is  quantified  on  a  continuous  [0, 1] 
scale  (such  automated  systems  do  exist  [Subrahmanian  and  Recupero  2008]),  then  the 
vertex  condition  might  be  restated  as  VC  =  { good(V )  :  l,negative-opinion.C(V)  :  0.7} 
which  says  that  the  company  wants  to  exclude  anyone  whose  negativity  about  the 
company  exceeds  0.7  according  to  an  opinion  scoring  engine  such  as  [Subrahmanian 
and  Recupero  2008]. 

Definition  3.2  (pre-answer  I  value).  Consider  a  SN  S  =  (V,  E.£vert,£edge,  w)  embed¬ 
ded  in  a  GAP  II.  A  pre-answer  to  the  SNDOP  query  Q  =  ( agg ,  VC,  k,gi(V),go(V))  w.r.t. 
II  is  any  set  V' CV  such  that: 

(1)  |V'|  <  k,  and 

(2)  for  all  vertices  v"  £  V',IIU  {giW)  ■  1 1  v'  £  V'}  f=  VC\V/v"]. 

We  use  pre_ans(<5,  II)  to  denote  the  set  of  all  pre-answers  to  Q  w.r.t.  II  (whenever  II  is 
clear  from  the  context  we  simply  write  pre_ans(Q)). 

The  value  of  a  pre-answer  V'  is  defined  as  follows: 

value(V)  =  agg({lfp(Tnu{gi{v,y1\v,eY})(go(v))  |  v  £  V}) 

where  the  aggregate  is  applied  to  a  multi-set  rather  than  a  set.  We  also  note  that  we 
can  define  value  as  a  mapping  from  interpretations  to  reals  based  on  a  SNDOP  query. 

We  say  value(I)  =  agg({I(g0(v))  |  v  £  V}). 

If  we  return  to  our  cell  phone  example,  V'  is  the  set  of  vertices  to  which  the  company 
is  considering  giving  free  plans.  value(V)  is  computed  as  follows. 

(1)  Find  the  least  fixpoint  of  Tn'  where  II' „  is  I  lceu  expanded  with  facts  of  the  form 

cell  Cefcfc 

will -adopt  (v')  :  1  for  each  vertex  v'  £  V'. 

(2)  For  each  vertex  v  £  V  (the  entire  set  of  vertices,  not  just  V'  now),  we  now  find  the 
confidence  assigned  by  the  least  fixpoint. 

(3)  Summing  up  these  confidences  gives  us  a  measure  of  the  expected  number  of  plan 
adoptees. 

Definition  3.3  (answer).  Suppose  an  SN  S  =  (V  ,E,£vert,£edge,w)  is  embedded  in  a 
GAP  II  and  Q  =  (agg,VC,k,gi(V),go(V))  is  a  SNDOP  query.  A  pre-answer  V'  is  an 
answer  to  the  SNDOP  query  Q  w.r.t.  II  iff  the  SNDOP  query  has  no  other  pre-answer 
V"  such  that  value(V")  >  value(V').10 

The  answer  set  to  SNDOP  query  Q  w.r.t.  ft,  denoted  ans(Q,  II),  is  the  set  of  all  answers 
to  Q  w.r.t.  II  (whenever  II  is  clear  from  the  context  we  simply  write  ans(Q)). 

It  is  important  to  note  that  an  answer  to  an  SNDOP  query  is  a  set  of  vertices  that 
jointly  maximize  the  objective  function  specified.  Thus,  it  is  entirely  possible  that  if 
we  set  k  =  1,  we  could  have  two  answers  {ai}  and  {a2}  each  of  which  ties  for  the 
highest  value.  However,  {ai,a2}  may  not  be  the  answer  that  optimizes  the  objective 
function  when  k  =  2. 

Example  3.4.  For  instance,  suppose  and  a2  are  brothers  with  largely  the  same 
connections.  The  sets  {ai}  and  {a2}  both  have  value  100  each  and  let  us  say  these 
constitute  an  answer  (looking  at  one  individual  only)  w.r.t.  an  objective  function,  e.g. 
influencing  voters  in  an  election  to  vote  for  candidate  X.  As  oq ,  a2  mostly  influence  the 


10Throughout  this  paper,  we  only  treat  maximization  problems  -  there  is  no  loss  of  generality  in  this  because 
minimizing  an  objective  function  f  is  the  same  as  maximizing  — /. 
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Table  I.  Special  cases  of  SNDOPs 


Type 

Special  Case 

Reference 

Special  cases  of  n 

Linear  GAP 

Definition  3.6 

Special  cases  of  agg 

Monotonic 

Positive-linear 

Definition  3.7 
Definition  3.8 

Special  cases  of  value 

Zero-starting 
A-priori  VC 

Definition  3.10 
Definition  3.12 

same  people,  they  may  jointly  be  able  to  get  only  110  people  to  vote  for  the  candidate 
because  of  the  large  overlap  in  their  sphere  of  influence.  However,  now  consider  per¬ 
sons  03,04.  Each  of  them  can  only  influence  90  voters  by  themselves,  but  only  10  of 
these  voters  “overlap”.  Thus,  they  can  jointly  influence  80  +  80  + 10  =  170  voters  to  vote 
for  X.  It  would  make  more  sense  (all  other  things  being  equal)  for  the  candidate’s  party 
to  invest  in  {03, 04}. 

Example  3.5.  Consider  the  GAP  I \cf  u  of  Example  2.6  with  the  social  network  from 
Figure  1  embedded  and  the  SNDOP  query  Qceu  =  ( SUM,ty,  3,  wilLadopt,  wilLadopt ). 
The  sets  V'x  =  {ri5,  tq9,  w6}  and  V'2  =  {i>i5,  rig,  are  both  pre-answers.  In  the  case  of 
V'1;  two  applications  of  the  Tn  operator  yields  a  fixpoint  where  the  vertex  atoms  formed 
with  wilLadopt  and  vertices  in  the  set  {tTs,  tqg,  T6,  fi8)  f7j  fio}  are  annotated  with 
1.  For  V2,  only  one  application  of  Tn  is  required  to  reach  a  fixpoint.  In  the  fixpoint, 
vertex  atoms  formed  with  wilLadopt  and  vertices  in  the  set  {^15,  vq,  V12,  vig,  V7,  vi 0 }  are 
annotated  with  1.  As  these  are  the  only  vertex  atoms  formed  with  wilLadopt  that  have 
a  non-zero  annotation  after  reaching  the  fixed  point,  we  know  that  value(y\)  =  7  and 
value(y'2)  =  6. 

3.2.  Special  Cases  of  SNDOPs 

In  this  section,  we  examine  several  special  cases  of  SNDOPs  that  still  allow  us  to 
represent  a  wide  variety  of  diffusion  models.  Table  I  illustrates  the  special  cases 
discussed  in  this  section  while  Table  II  illustrates  various  properties  we  prove  (and 
the  assumptions  under  which  those  properties  are  proved). 

Special  Cases  of  GAPs.  First,  we  present  a  class  of  GAPs  called  linear  GAPs.  Intu¬ 
itively,  a  GAP  is  linear  if  the  annotations  in  the  rule  heads  are  linear  functions  and 
the  annotations  in  the  body  are  variables.  It  is  important  to  note  that  a  wide  variety 
of  diffusion  models  can  be  represented  with  GAPs  that  meet  the  requirements  of  this 
special  case.  We  formally  define  linear  GAPs  below. 

Definition  3.6  (Linear  GAP).  A  GAP-rule  is  linear  iff  it  is  of  the  form: 

Hq  '■  Co  +  Ci  •  Xi  +  •  •  •  +  cn  ■  Xn  <r-  A\  :  X\  A  ...  A  An  :  Xn 

where  each  r,  e  [0, 1],  E™=1Cj  £  [0, 1],  and  each  X:j  is  a  variable  in  AVar.  A  GAP  is  linear 
iff  each  rule  in  it  is  linear. 

Note  that  linear  GAPs  allow  for  a  wide  variety  of  models  to  be  expressed.  Section  4 
will  show  that  several  well-known  network  diffusion  models  can  be  embedded  into  our 
framework.  Diffusion  Models  4.2  and  4.4,  reported  in  Section  4,  are  linear  GAPs  while 
Diffusion  Models  4.1  and  4.3  are  not. 

Special  Aggregates.  We  define  two  types  of  aggregates:  monotonic  aggregates  and 
positive-linear  aggregates. 

To  define  monotonicity,  we  first  define  a  partial  order  C  on  multi-sets  of  numbers  as 
follows:  given  two  multi-sets  of  numbers  X\  and  X2,  we  write  X\  C  X2  iff  there  exists 
an  injective  mapping  /3  :  Xi  — >  X2  such  that  Vaq  £  Xi,  x\  <  P(x\). 
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Definition  3.7  {Monotonic  Aggregate).  An  aggregate  agg  is  monotonic  iff  when¬ 
ever  Xi  C  X2,  it  is  the  case  that  agg{X i)  <  agg{X2). 

Definition  3.8  {Positive-Linear  Aggregate).  An  aggregate  agg  is  positive-linear  iff 
it  is  defined  as  follows:  agg{X)  =  c0  +  C  ■  a where  A  is  a  finite  multiset  and 

Ci  >  0  for  all  *  >  0. 

In  the  previous  definition,  note  that  c0  can  be  positive,  negative,  or  0.  Thus,  we  only 
require  that  the  coefficients  associated  with  the  elements  of  the  multi-set  be  positive 
-  we  allow  for  an  additive  constant  to  be  negative.  One  obvious  example  of  a  positive- 
linear  aggregate  is  SUM.  Moreover,  any  positive  weighted  sum  will  also  meet  these 
requirements. 

PROPOSITION  3.9.  If  agg  is  a  positive-linear  aggregate,  then  it  is  a  monotonic  ag¬ 
gregate. 

Special  cases  of  the  query.  We  now  describe  two  special  cases  of  the  query: 
zero-starting  and  a-priori  VC  SNDOP  queries.  Intuitively,  zero-starting  means  that 

value (0)  =  0. 

Definition  3.10  {Zero -starting).  An  SNDOP  query  is  zero-starting  w.r.t.  a  given 
social  network  S  and  a  GAP  II  D  n5  iff  value{0)  =  0. 

Note  that  the  function  value  is  uniquely  defined  by  a  social  network,  a  SNDOP  query, 
and  a  diffusion  model  II  and  hence  the  above  definition  is  well  defined. 

The  following  result  states  that  if  an  SNDOP  query  Q  with  a  positive-linear  aggre¬ 
gate  is  not  zero-starting,  then  we  can  always  modify  it  into  an  “equivalent”  SNDOP 
query  Q'  (i.e.  ans(Q)  =  ans(Q'))  which  is  zero-starting  and  still  maintains  a  positive- 
linear  aggregate. 

PROPOSITION  3.11.  Let  Q  =  {agg,VC,k,gi{V),go{V))  be  a  SNDOP  query  which 
is  not  zero-starting  w.r.t.  a  social  network  S  and  a  GAP  II  D  I  \s,  and  where  agg  is 
positive-linear.  Let  agg'{X)  =  agg{X)  —  valued).  Then,  Q'  =  {agg' ,VC,k,  gi{V),  go{V)) 
is  a  SNDOP  query  which  is  zero-starting  w.r.t.  S  and  II,  ans(Q)  =  ans(Q'),  and  agg'  is 
positive-linear. 

Recall  that  in  order  to  check  if  a  set  of  vertices  V'  is  a  pre-answer,  we  need  to  check 
for  all  vertices  v"  £  V'  if  II  U  {gi{v')  :  1  |  v'  £  V'}  |=  VC\V/v"]  (see  condition  (2)  of 
Definition  3.2).  Intuitively,  a  SNDOP  query  has  an  A-Priori  VC  (w.r.t.  a  given  social 
network  S  and  a  GAP  II  D  Ifs)  when  we  can  check  this  condition  by  looking  only  at  the 
original  social  network  S  (thereby  disregarding  II),  that  is  we  can  check  for  all  vertices 
v"  £  V'  if  ns  U  {gi{v")  :  1}  |=  VC\V/v"].  We  formally  define  this  notion  below. 

Definition  3.12  {A-Priori  VC).  A  SNDOP  query  Q  =  {agg, VC, k, gi{V), go{V))  has 
an  A-Priori  VC  w.r.t.  a  given  social  network  S  =  (V,  E,£vert,£edge,w)  and  a  GAP  II  D 
Ifs  iff  for  each  V'  C  V  the  following  holds:  for  each  v"  £  V',  II  U  {gi{v')  :  L  |  v'  £  V'}  |= 

VC[V/v"}  iff  Us  U  {gi{v")  :  1}  |=  VC[V/v"). 

If,  in  the  cell  phone  example,  we  require  that  the  free  cell  phones  are  given  to  “good” 
vertices,  then  query  (Ql)  is  a-priori  VC  because  being  “good”  may  be  defined  statically 
and  is  not  determined  by  the  diffusion  process.  Likewise,  if  we  consider  our  medical 
example,  in  the  case  of  an  a-priori  VC  query  (Q2)  saying  that  an  individual  below  5 
should  not  get  the  medicine,  this  boils  down  to  a  static  labeling  of  each  node’s  age 
(below  5  or  not)  which  is  not  affected  by  the  diffusion  process. 
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Table  II.  Properties  that  can  be  proven  given  certain  assumptions 


Property 

Assumptions 

Monotonicity  of  value  (Lemma  3.13) 

Monotonicity  of  agg 

Multiset  (V'  C  V|V'  is  a  pre-answer }  is  a  uniform  matroid 
(Lemma  3.14) 

A-priori  VC 

Submodularity  of  value  (Theorem  3.15) 

Linear  GAP 
Positive-linear  agg 
A-priori  VC 

Table  III.  How  the  various  properties  are  leveraged  in  the  Algorithms 


Algorithm 

Property 

Exact  algorithm  with  pruning  (Section  5.2) 

Monotonicity  of  value 

Approx.  Ratio  on  Greedy  Algorithm  (Section  5.3) 

Submoduiarity 

Zero-starting 

Uniform  matroid  for  the  pre-answers 

3.3.  Properties  of  SNDOPs 

In  this  section,  we  will  prove  several  useful  properties  of  SNDOPs  that  use  various 
combinations  of  the  assumptions  presented  in  the  previous  section.  Later,  we  will 
leverage  some  of  these  properties  in  our  algorithms.  Table  II  summarizes  the  differ¬ 
ent  properties  that  we  prove  in  this  section  (as  well  as  what  assumptions  we  make 
to  prove  these  properties).  Table  III  shows  how  these  properties  are  leveraged  in  the 
algorithms  that  we  will  present  later  in  the  paper. 

We  say  that  function  value  is  monotonic  iff  Vx  C  V2  implies  valueiy i)  <  i>ahte(V2) 
for  any  two  sets  of  vertices  V i  and  V2.  The  first  property  we  show  is  that  the  value 
function  is  monotonic  if  agg  is  monotonic. 

LEMMA  3.13.  Given  a  SNDOP  query  Q  =  (agg,VC,k,gi(V),go(V)),  a  social  net¬ 
work  S,  and  a  GAP  II  D  If  5,  if  agg  is  monotonic  (Definition  3.7),  then  value  (defined  as 
per  Q  and  II)  is  monotonic. 

Before  introducing  the  next  result  we  recall  the  definitions  of  matroid  and  uniform 
matroid.  A  matroid  is  a  pair  ( X ,  I)  where  A  is  a  finite  set  and  /  is  a  collection  of  subsets 
of  X  (called  “independent”),  satisfying  two  axioms: 

(1)  B  £  I,  A  c  B  =>  A  £  I. 

(2)  A,B  £  /,  \A\  <  \B\  =>  3x  £  B  -  A  s.t.  Au  {x}  £  I. 

A  uniform  matroid  is  a  matroid  such  that  independent  sets  are  all  sets  of  size  at  most 
k  for  some  k>  1. 

Next,  we  show  that  the  set  of  pre-answers  is  a  uniform  matroid  in  the  special  case 
of  an  a-priori  VC  query. 

LEMMA  3.14.  Given  a  SNDOP  query  Q  =  ( agg,VC,k,gi(V),go(V )),  a  social  net¬ 
work  S,  and  a  GAP  II  D  11 5,  if  Q  is  a-priori  VC  w.r.t.  S  and  II,  then  the  set  of  pre-answers 
is  a  uniform  matroid. 

As  we  will  see  in  Section  5,  the  above  lemma  (along  with  other  properties,  see  The¬ 
orem  5.8)  enables  us  to  define  a  greedy  approximation  algorithm  to  solve  SNDOP 
queries  that  achieves  the  best  possible  approximation  ratio  that  a  polynomial  algo¬ 
rithm  can  achieve  (unless  P  =  NP). 

An  important  property  in  social  networks  is  suhmodularity  whose  relationship  to  the 
spread  of  phenomena  in  social  networks  has  been  extensively  studied  [Mossel  and  Roch 
2007;  Kleinberg  2008;  Leskovec  et  al.  2007a].  If  A  is  a  set,  then  a  function  /  :  2A  — >  R. 
is  suhmodular  iff  whenever  Ai  C  A2  C  A  and  x  £  X  —  A2,  f(X±  U  {m})  —  /( Ai)  > 
/( A2  U  {cc})  —  /( A2).  The  following  result  states  that  the  value  function  associated  with 
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Fig.  2.  Social  network  corresponding  with  Example  3.16  concerning  disease  spread. 

a  linear  GAP  and  an  a-priori  VC  SNDOP  query  whose  aggregate  is  positive-linear  is 
guaranteed  to  be  submodular. 

THEOREM  3.15.  Given  an  SNDOP  query  Q  =  ( agg,VC,k,gi(V),go(V )),  a  social 
network  S,  and  a  GAP  II  D  11 5,  if  the  following  criteria  are  met: 

—  II  is  a  linear  GAP, 

—  Q  is  a-priori  VC,  and 

—  agg  is  positive-linear, 

then  value  (defined  as  per  Q  and  II)  is  sub-modular. 

In  other  words,  for  Vcond  =  {v'\v'  £  V and  (Ifs  U  {gi(v')  :  1}  |=  VC\V /v'])},  if  V\  C 
V2  C  Vcond  and  v  £  Vcond  —  V2,  then  the  following  holds: 

value(Vi  U  {t>})  —  value(Vf)  >  value(V-2  U  {t>})  —  value(V2) 

Proof  Sketch:  Consider  a  linear  polynomial  with  a  variable  for  each  vertex  in  the  set 
of  vertices  that  meet  the  a-priori  VC,  where  setting  the  variable  to  1  corresponds  to  the 
vertex  being  picked  and  setting  it  to  0  indicates  otherwise.  For  any  subset  of  vertices 
meeting  the  a-priori  VC,  there  is  an  associated  polynomial  of  this  form  such  that  when 
the  variables  corresponding  to  the  vertices  are  set  to  1  (and  the  rest  set  to  0),  the  answer 
is  equal  to  the  corresponding  value  for  that  set.  For  a  sets  V\ ,  V>  and  vertex  v  (as  per  the 
statement),  we  show  that  submodualirty  holds  by  manipulating  such  polynomials. 

Example  3.16.  We  now  show  an  example  of  a  SNDOP  query  and  a  non-linear  GAP 
for  which  the  value  function  is  not  sub-modular.  Figure  2  shows  a  social  network  with 
one  edge  predicate,  e  -  all  edges  are  weighted  with  1.  Nodes  in  the  network  are  either 
susceptible  to  the  disease  (circles)  or  carriers  (diamonds)  -  the  associated  predicates  are 
sue  and  car  respectively.  Additionally,  we  have  the  predicates  inf ’  exp  denoting  vertices 
that  have  been  infected  by  or  exposed  to  the  disease.  No  vertex  is  initially  exposed  or 
infected  in  the  social  network  of  Figure  2. 

Let  nd,seQse  be  the  embedding  of  this  network  plus  the  following  diffusion  rules. 

exp(V)  :  1  £-  inf(V)  :  1 

exp{V)  :  1 «—  e{V' ,  V)  :  1  A  inffV')  :  1  A  suc(V)  :  1 


inf{V)  : 


Hi 

-Hi  Ei. 


<r-  exp(V)  :  1  A  f\  (edge(Vi,  V)  :  Ei  A  inf  (Vi )  :  If) 

Vi\(Vi,v)eE 


Intuitively,  the  second  rule  says  that  a  vertex  becomes  exposed  if  that  vertex  is  sus¬ 
ceptible  and  it  has  at  least  one  incoming  neighbor  that  is  infected.  The  third  rule  states 
that  a  vertex  becomes  infected  if  it  is  exposed  and  all  its  incoming  neighbors  are  in¬ 
fected. 

Consider  the  function  value  defined  as  per  the  SNDOP  query 
(SUM,®,2,inf(V),inf(V))  and  Tldisease.  Obviously,  as  the  GAP  is  not  linear,  it 
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does  not  meet  the  requirements  of  Theorem  3.15  to  prove  submodularity.  We  can 
actually  show  through  counterexample,  that  this  SNDOP  query  is  not  submodular. 
Consider  the  following: 


value({v  1,^5})  —  value({v  1})  =  1 
(here  value({v  1 ,  f 5})  =  2  and  value({v  1})  =  1)  and 

value({v i,vr,  tq})  —  value{{v  1,^7})  =  4 

(here  value({v  i,v?,vs})  =  7  and  value({v i,v?})  =  3). 

This  shows  a  clear  violation  of  submodularity. 

As  an  example  of  how  the  values  above  are  determined,  consider  value ({v l,u5}). 
Notice  that  the  third  rule  of  ndiseose  is  the  only  one  that  can  be  used  to  propagate  the 
inf  property,  but  in  order  for  a  vertex  V  to  get  infected  using  this  rule,  V  has  to  be 
exposed  first  (and  all  its  incoming  neighbors  have  to  be  infected).  When  v\  and  iq  are 
assumed  to  be  infected,  tq  gets  exposed  (vi  and  v5  get  exposed  as  well  because  of  the 
first  rule).  At  this  point,  the  exposed  property  cannot  be  propagated  any  further,  and 
no  vertex  can  get  infected  because  no  vertex  is  both  exposed  and  has  all  its  incoming 
neighbors  infected  (notice  that  tq  cannot  get  infected  because  v6  is  not  infected).  Thus, 
value{{v  1 ,  W5  } )  =  2. 

3.4.  The  Complexity  of  SNDOP  Queries 

We  now  study  the  complexity  of  answering  an  SNDOP  query.  First,  we  show  that 
SNDOP  query  answering  is  NP-hard  by  a  reduction  from  max  fc-cover  [Feige  1998]. 
We  show  that  the  problem  is  NP-hard  even  when  many  of  the  special  cases  hold. 

THEOREM  3.17.  Finding  an  answer  to  an  SNDOP  query  Q  = 
(agg,VC,k, gi(V), go{V))  (w.r.t.  a  social  network  S  and  a  GAP  II  D  II5J  is  NP-hard 
(even  if  II  is  a  linear  GAP,  VC  =  0,  agg  =  SUM  and  value  is  zero-starting). 

Proof  Sketch:  The  known  NP-hard  problem  of  MAX-K-COVER  [Feige  1998]  is  defined 
as  follows. 

INPUT:  Set  of  elements,  S  and  a  family  of  subsets  of  S,  H  =  {  H  \ .....  Hmax},  and 
positive  integer  K. 

OUTPUT:  Less  than  or  equal  to  K  subsets  from  H  such  that  the  union  of  the  subsets 
covers  a  maximal  number  of  elements  in  S. 

We  show  that  MAX-K-COVER  can  be  embedded  into  a  social  network  and  that 
the  corresponding  SNDOP  query  gives  an  optimal  answer  to  MAX-K-COVER.  The 
embedding  is  done  by  creating  a  social  network  resembling  a  bipartite  graph,  where 
vertices  represent  either  the  elements  or  the  subsets  from  the  input  of  MAX-K-COVER. 
For  every  vertex  pair  representing  a  set  and  an  element  of  that  set,  there  is  an  edge 
from  the  set  vertex  to  the  element  vertex.  A  single  vertex  and  edge  predicate  are 
used  -  vertex  and  edge.  A  single  non-ground  diffusion  rule  is  added  to  the  GAP: 
vertexfV)  :  X  vertexfV' )  :  X  A  edge(V' ,  V)  :  1.  The  aggregate  is  simply  the  sum 
of  the  annotations  associated  with  the  vertex  atoms.  We  show  that  the  picked  vertices 
that  maximize  the  aggregate  correspond  with  picked  subsets  that  maximize  output 
of  the  problem.  Also,  as  we  do  not  use  VC,  the  GAP  is  linear,  and  the  aggregate  is 
positive-linear,  we  know  that  the  value  function  is  submodular. 

Under  some  conditions,  the  decision  problem  for  SNDOP  queries  is  also  in  NP. 

THEOREM  3.18.  Given  a  SNDOP  query  Q  =  {agg,  VC,  k,  gi(V),  go(V)),  a  social  net¬ 
work  S,  a  GAP  II  D  1 1 5,  and  a  real  number  target,  the  problem  of  checking  whether 
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there  exists  a  pre-answer  V  s.t.  value(V)  >  target  is  in  NP  under  the  assumptions  that 
agg  and  the  functions  in  T  are  polynomially  computable,  and  II  is  ground. 

Most  common  aggregate  functions  like  SUM,  AVERAGE,  Weighted  average,  MIN, 
MAX,  COUNT  are  all  polynomially  computable.  Moreover,  the  assumption  that  the 
functions  corresponding  to  the  function  symbols  in  T  (i.e.  the  function  symbols  that  can 
appear  in  the  annotations  of  a  GAP)  are  polynomially  computable  is  also  reasonable. 

Later  in  this  paper,  we  shall  address  the  problem  of  answering  a  SNDOP  query 
using  an  approximation  algorithm.  We  re-state  the  definition  of  approximation  below 
(see  [Garey  and  Johnson  1979]). 

Definition  3.19  (. Approximation ).  Consider  a  maximization  problem  and  let  OPT {I) 
denote  the  value  of  an  optimal  solution  for  an  instance  I  of  the  problem.  An  a- 
approximation  algorithm  A  is  an  algorithm  that  for  any  instance  I  finds  a  candidate 
solution  such  that 


OPT(I)  <  a  ■  A{I) 

where  A(I)  denotes  the  value  of  the  solution  found  by  A  for  instance  I. 

Based  on  the  above  definition,  we  shall  say  that  V'  is  a  - -approximation  to  an 
SNDOP  query  if  value(\fopt)  <  a  ■  value(V')  (where  Vopt  is  an  answer  to  the  SNDOP 
query).  Likewise,  the  algorithm  that  produces  V'  in  this  case  is  an  n-approximation 
algorithm.  We  note  that  due  to  the  nature  of  the  reduction  from  MAX-K-COVER  that 
we  used  to  prove  NP-hardness,  we  can  now  show  that  unless  P  =  NP,  there  is  no 
PTIME-approximation  of  the  SNDOP  query  answering  problem  that  can  guarantee 
that  the  approximate  answer  is  better  than  0.63  of  the  optimal  value.  This  gives  us 
an  idea  of  the  limits  of  approximation  possible  for  a  SNDOP  query  with  a  polynomial¬ 
time  algorithm.  Later,  we  will  develop  a  greedy  algorithm  that  precisely  matches  this 
approximation  ratio. 

THEOREM  3.20.  Answering  a  SNDOP  query  Q  =  {agg,  VC,  k,  gi(V),  go(V))  (w.r.t.  a 
social  network  S  and  a  GAP  II  D  II $)  cannot  be  approximated  in  PTIME  within  a  ratio 
of  -f-  c  for  some  e  >  0  (where  e  is  the  inverse  of  the  natural  log)  unless  P  =  NP  -  even 
if  1 1  is  a  linear  GAP,  VC  =  0,  agg  =  SUM  and  value  is  zero -starting. 

In  other  words,  the  previous  theorem  says  that  there  is  no  polynomial-time  algo¬ 
rithm  that  can  approximate  value  within  a  factor  of  about  0.63  under  standard  as¬ 
sumptions. 

3.5.  Counting  Complexity  of  SNDOP-Queries 

In  this  section,  we  ask  the  question:  how  many  answers  are  there  to  a  SNDOP  query 
(agg,  VC,  k,  gi(V),  go(V))?  In  the  case  of  the  cell  phone  example,  this  corresponds  to 
asking  “How  many  sets  ANS  of  people  are  there  in  the  the  network  such  that  ANS 
has  k  or  fewer  people  and  ANS  optimizes  the  aggregate,  subject  to  the  vertex  condition 
V CT  If  there  are  m  such  sets  AN  Si ,... ,  AN Sm,  this  means  the  cell  phone  company  can 
give  the  free  cell  phone  plan  to  either  all  members  of  AN  Si  or  to  all  members  of  ANS-2, 
and  so  forth.  The  “counting  complexity”  problem  of  determining  to  is  is  # /-‘-complete. 

THEOREM  3.21.  Counting  the  number  of  answers  to  a  SNDOP  query  Q  (w.r.t.  a 
social  network  S  and  a  GAP  II  D  1 1 5)  is  (fP-cotnplete. 

3.6.  The  SNDOP-ALL  Problem 

Suppose  the  cell  phone  company  wants  to  identify  all  the  “most  influential”  users,  that 
is,  the  users  that  when  considered  individually  (and  not  as  a  set)  yield  a  maximum 
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expected  number  of  plan  adoptees.  This  might  be  computed  by  taking  the  union  of  all 
the  answers  to  query  (Ql)  with  k  =  1.  For  instance,  if  we  consider  the  hypothetical 
example  of  the  political  candidate  (Example  3.4),  the  candidate  may  also  want  to  know 
all  the  top  influencers  when  considered  individually .  In  this  case,  vertices  m,  a2  would 
emerge  in  the  answer  to  a  SNDOP-ALL  query  defined  below. 

Although  the  counting  version  of  the  query  is  ffP- hard,  finding  the  union  of  all  an¬ 
swers  to  a  SNDOP  query  is  no  harder  than  a  SNDOP  query  (w.r.t.  polynomial-time 
Turing  reductions).  We  shall  refer  to  this  problem  as  SNDOP-ALL  -  and  it  reduces 
both  to  and  from  a  regular  SNDOP  query  (w.r.t.  polynomial-time  Turing  reductions). 

We  start  with  the  following  result,  showing  that  we  can  answer  a  SNDOP  query  in 
PTIME  with  an  oracle  to  SNDOP-ALL. 

THEOREM  3.22.  Given  a  SNDOP  query  Q  =  (agg,VC,  k,  gi(V),  go(V)),  a  social  net¬ 
work  S,  and  a  GAP  II  D  I  [5),  there  exists  a  polynomial-time  algorithm  with  an  oracle 
to  SNDOP-ALL  which  answers  Q. 

Proof  Sketch:  We  embed  a  SNDOP  query  in  a  SNDOP-ALL  query  via  the  following 
informal  algorithm  (FIND-SET)  that  takes  an  instance  of  SNDOP-ALL  (Q)  and  some 
vertex  set  V*,  |V*|  <  k. 

(1)  If  |V*|  =  k,  return  F* 

(2)  Else,  solve  SNDOP-ALL(V*),  returning  set  V" . 

(a)  IfV"  -  V*  =  0,  return  F* 

(b)  Else,  pick  v  €  V"  —  V*  and  return  FIND-SET(Q ,  V*  U  v) 

The  theorem  below  shows  that  SNDOP-ALL  can  be  answered  in  PTIME  with  an 
oracle  to  a  SNDOP  query. 

THEOREM  3.23.  Given  a  SNDOP  query  Q  =  (agg,VC,k,  gi(V),go(V)),  a  social  net¬ 
work  S,  and  a  GAP  II  D  I  [5),  finding  LVeans(Q)  ^  reduces  to  \  V\  +  1  SNDOP  queries, 
where  Vis  the  set  of  vertices  of  S. 

Proof  Sketch:  Using  an  oracle  that  correctly  answers  SNDOP  queries,  we  can  answer 
a  SNDOP-ALL  query  by  setting  up  \  V\  SNDOP  queries  as  follows: 

—  Let  kaii  be  the  k  value  for  the  SNDOP-ALL  query  and  for  each  SNDOP  query  i,  let  k, 
be  the  k  for  that  query.  For  each  query  i,  set  hi  =  kau  —  1. 

— Number  each  element  of  vt  e  V  such  that  gi(vi)  and  VC(vi )  are  true.  For  the  ith 
SNDOP  query,  let  v%  be  the  corresponding  element  of  V 

—  Let  1 1 ,  refer  to  the  GAP  associated  with  the  ith  SNDOP  query  and  I \„u  be  the  program 
for  SNDOP-ALL.  For  each  program  IL,  add  fact  gi(vi)  :  1 

—  For  each  SNDOP  query  i,  the  remainder  of  the  input  is  the  same  as  for  SNDOP-ALL. 

After  the  construction,  do  the  following: 

( 1 )  We  shall  refer  to  a  SNDOP  query  that  has  the  same  input  as  SNDOP-ALL  as  the 

“primary  query.” Let  V'ans('prt'1  be  an  answer  to  this  query  and  value(Van}pri'>)  be  the 
associated  value. 

(2)  For  each  SNDOP  query  i,  let  Vans  be  an  answer  and  value(  Vans  )  be  the  associ¬ 
ated  value. 

(3)  Let  V,  the  solution  to  SNDOP-ALL  be  initialized  as  0. 

(4)  For  each  SNDOP  query  i,  if  value(Vans^)  =  value(Van}pr ^),  then  add  vertex  Vi  to 
V’. 
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4.  APPLYING  SNDOPS  TO  REAL  DIFFUSION  PROBLEMS 

In  this  section,  we  show  how  SNDOPs  can  be  applied  to  real-word  diffusion  problems. 
Most  diffusion  models  in  the  literature  fall  into  one  of  three  categories  -  tipping  mod¬ 
els  (Section  4.1),  where  a  given  vertex  adopts  a  behavior  based  on  the  ratio  of  how 
many  of  its  neighbors  previously  adopted  the  behavior,  cascade  models  (Section  4.2), 
where  a  property  passes  from  vertex  to  vertex  solely  based  on  the  strength  of  the  re¬ 
lationship  between  the  vertices,  and  homophilic  models  (Section  4.3),  where  vertices 
with  similar  properties  tend  to  adopt  the  same  behavior  -  irrespective  (or  in  addition 
to)  of  network  relationships.  None  of  these  approaches  solves  SNDOP  queries  —  they 
merely  specify  diffusion  models  rather  than  performing  the  kinds  of  optimizations  that 
we  perform  in  SNDOP  queries. 

4.1.  Tipping  Diffusion  Models 

Tipping  models  [Centola  2010;  Schelling  1978;  Granovetter  1978]  have  been  studied 
extensively  in  economics  and  sociology  to  understand  diffusion  phenomena.  In  tipping 
models,  a  vertex  changes  a  property  based  on  the  cumulative  effect  of  its  neighbors.  In 
this  section,  we  present  the  tipping  model  of  Jackon-Yariv  [Jackson  and  Yariv  2005], 
which  generalizes  many  existing  tipping  models. 

The  Jackson- Yariv  Diffusion  Model  [Jackson  and  Yariv  2005].  In  this  frame¬ 
work,  the  social  network  is  just  an  undirected  graph  =  (V\  E7)  consisting  of  a  set 
of  agents  (e.g.  people).  Each  agent  has  a  default  behavior  (A)  and  a  new  behavior  (B). 
Suppose  di  denotes  the  degree  of  a  vertex  ut.  [Jackson  and  Yariv  2005]  use  a  func¬ 
tion  7  :  {0, . . . ,  |V'|  —  1}  — >  [0, 1]  to  describe  how  the  number  of  neighbors  of  v  affects 
the  benefits  to  v  for  adopting  behavior  B.  For  instance,  7(3)  specifies  the  benefits  (in 
adopting  behavior  B)  that  accrue  to  an  arbitrary  vertex  v  £  V'  that  has  three  neigh¬ 
bors.  Let  7 t,  denote  the  fraction  of  neighbors  of  v,  that  have  adopted  behavior  B.  Let 
constants  b,  and  pt  be  the  agent-specific  benefit  and  cost,  respectively,  for  vertex  Vi  to 
adopt  behavior  B.  [Jackson  and  Yariv  2005]  state  that  node  vt  switches  to  behavior  B 
iff  •  7 (di)  •  77  >  1. 

Returning  to  our  cell-phone  example,  one  could  potentially  use  this  model  to  describe 
the  spread  of  the  new  plan.  In  this  case,  behavior  A  would  be  adherence  to  the  current 
plan  the  user  subscribes  to,  while  B  would  be  the  use  of  the  new  plan.  The  associated 
SNDOP  query  would  find  a  set  of  nodes  which,  if  given  a  free  plan,  would  jointly  maxi¬ 
mize  the  expected  number  of  adoptees  of  the  plan.  Cost  and  benefit  could  be  computed 
from  factors  such  as  income,  time  invested  in  switching  plans,  etc.  We  show  how  the 
model  of  [Jackson  and  Yariv  2005]  can  be  expressed  via  GAPs. 

Diffusion  Model  4.1  (Jackson-Yariv  MODEL).  Given  a  Jackson-Yariv  model 
consisting  of  &  =  (V  ,E),  we  can  set  up  a  social  network  S  =  (1 / ,  E' ,(vert,(edge,w) 
as  follows.  We  define  E'  =  {(x,  y),  (y,x)  \  (x,y)  £  E}.  We  have  a  single  edge  predicate 
symbol  edge  which  is  assigned  by  (.edge  to  every  edge  in  E' ,  and  w  assigns  1  to  all  edges 
in  E' .  Our  associated  GAP  H  jy  now  consists  of  As  plus  one  rule  of  the  following  form 
for  each  vertex  v, : 

B(vi)  :  ^  •  7  [  ^2  Ej  ]  •  J  £-  f\  ( edge(vj,Vi )  :  Ej  A  B{vj )  :  Xj) 

Pi  \J  )  ^  j\  v,\  (v^vqeE" 

It  is  easy  to  see  that  this  rule  (when  applied  in  conjunction  with  II5  for  a  social 
network  S)  precisely  encodes  the  Jackson-Yariv  semantics. 
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Fig.  3.  Social  network  of  individuals  sharing  photographs. 

We  notice  right  away  that  the  above  GAP  is  not  linear.  However,  if  we  eliminate  the 
floor  function  and  impose  certain  restrictions  on  the  coefficients  appearing  in  the  head 
of  the  rules,  then  we  obtain  a  linear  GAP  that  represents  a  variant  of  this  model  where 
the  annotation  would  represent  a  “confidence”  that  an  agent  adopts  behavior  B.  The 
idea  of  the  confidence  of  an  agent  represented  by  a  vertex  adopting  a  certain  behavior 
as  a  function  of  his  adopting  neighbors  is  suggested  in  the  experiment  of  [Centola  2010] 
where  he  observed  that  the  preference  for  a  new  behavior  increases  monotonically  with 
the  number  of  incoming  neighbors  who  previously  adopted  said  behavior.  Such  a  linear 
GAP  model  is  presented  below. 

Diffusion  Model  4.2  (Linear  Jackson-Yariv  model).  For  each  vertex  i>,  let 


1 

'Zvj\{vj,vi)eE"  w{{Vj,Ui)) 


If  for  each  vertex  vit  Ci  €  [0,1]  and  |{i>j  |  ( Vj,Vi )  £  E'}\  x  q  <  1,  then  we  can  derive  a 
linear  GAP  for  the  Jackson-Yariv  model  that  consists  of  one  rule  of  the  following  form 
for  each  vertex  Vi 

B(vf)  :  aY^Xj  <r-  A  li{r<]  :  X> 

3  Vj\{vj,Vi)^E" 

Notice  that  the  above  rule  is  similar  to  the  one  in  Diffusion  Model  4.1,  but  the  floor 
function  has  been  dropped  and  restrictions  on  the  c/s  are  imposed  to  make  the  rule 
linear. 


Example  4.1.  Figure  3  illustrates  a  social  network  of  individuals  who  share  pho¬ 
tographs.  Each  edge  is  labeled  with  the  predicate  share  and  has  weight  1.  The  only 
vertex  predicate  we  consider  in  this  case  is  buyswamera. 

A  vendor  wishes  to  sell  a  camera  and  wants  to  see  how  the  popularity  of  the  camera 
will  spread  in  the  network.  He  wants  to  use  a  Jackson-Yariv  style  diffusion  model. 
Suppose  the  social  network  is  embedded  into  GAP  II  which  has  one  Jackon-Yariv  style 
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Table  IV.  Comparison  between  standard  and  linear  Jackson-Yariv  Models 


Vertex  Atom 

Annotation  Assigned  by 

Up(TncameraU {buys .earner a(v2):l}  ) 

Annotation  Assigned  by 

l  f  Tllin\J {buy s .earner a(v2) -  I}  ) 

buys -earner  a(y\ ) 

0.0 

0.5 

buys  .earner  a(v2 ) 

1.0 

1.0 

buys -earner  a(yz) 

1.0 

1.0 

buys -earner  a(v  4) 

0.0 

0.0 

buys -earner  a(vs ) 

0.0 

0.0 

buys -earner  a(yo) 

0.0 

0.0 

buys  .earner  a{yr) 

0.0 

0.25 

buys  .earner  a(ys>) 

0.0 

0.5 

buys -earner  a(vg) 

0.0 

0.5 

buys  .earner  a  ( v  1 0 ) 

0.0 

0.5 

SUM 

2 

4.25 

tipping  diffusion  rule  of  the  following  form  for  each  vertex  v: 

y.,x, 


buys  -earner  a(v)  : 


•<—  /\  ( shares(i>j,v )  :  Ej  A  buys. earner a(vj)  :  Xj) 

vi  |(t>*,t>)eE 


We  will  call  the  GAP  with  the  above  diffusion  rule  Hstandard.  Alternatively,  we  could 
have  a  linear  version  of  it  as  follows: 


buys  -Carrier  a(v)  : 


T.,*, 


«- 


buys. earner  a{vj)  :  Xj 


Vj\(vj  ,v)GE 


We  will  call  the  GAP  formed  with  the  previous  kind  of  diffusion  rules  I lun.  In  this  case, 
it  is  clear  that  each  rule  head  is  annotated  with  the  linear  expression: 


c0  +  ci  ■  Xi  +  ...  +  C|{t^.|(„j.;tJ)£E}|  • 

Here,  co  =  0  and  for  all*  >  0  we  have, 

1 

Clearly,  each  c*  G  [0, 1]  and  the  sum  of  all  these  constants  is  1,  which  gives  us  linearity 
in  accordance  with  Definition  3.6.  Table  IV  shows  the  least  fixed  point  for  the  two 
different  GAPs  (original  JY  model  and  the  linear  version)  that  arise  when  we  assign 
an  annotation  of  1  to  vertex  atom  buys. earner a(vf)  —  it  also  shows  the  sum  of  the 
annotations. 


4.2.  Cascading  Diffusion  Models 

In  a  cascading  model,  a  vertex  obtains  a  property  from  one  of  its  neighbors,  typically 
based  on  the  strength  of  its  relationship  with  the  neighbor.  These  models  were  intro¬ 
duced  in  the  epidemiology  literature  in  the  early  20th  century,  but  gained  increased 
notice  with  the  seminal  work  of  [Anderson  and  May  1979].  Recently,  cascading 
diffusion  models  have  been  applied  to  other  domains  as  well.  For  example,  [Cha  et  al. 
2008]  (diffusion  of  photos  in  Flickr)  and  [Sun  et  al.  2009]  (diffusion  of  bookmarks  in 
FaceBook)  both  look  at  diffusion  process  in  social  networks  as  “social  cascades”  of 
this  type.  In  this  section,  we  present  an  encoding  of  the  popular  SIR  model  of  disease 
spread  in  our  framework. 

The  SIR  Model  of  Disease  Spread.  The  SIR  ( susceptible ,  infectious,  removed)  model 
of  disease  spread  [Anderson  and  May  1979]  is  a  classic  disease  model  which  labels  each 
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vertex  in  a  graph  G  =  (V,  E)  (of  humans)  with  susceptible  if  it  has  not  had  the  disease 
but  can  receive  it  from  one  of  its  neighbors,  infectious  if  it  has  caught  the  disease  and 
trec  units  of  time  have  not  expired,  and  removed  when  the  vertex  can  no  longer  catch 
or  transmit  the  disease.  The  SIR  model  assumes  that  a  vertex  v  that  is  infected  can 
transmit  the  disease  to  any  of  its  neighbors  v'  with  a  probability  pv>v,  for  trec  units  of 
time.  It  is  assumed  that  becoming  infected  takes  precisely  a  time  unit.  We  would  like 
to  find  a  set  of  at  most  k  vertices  that  would  maximize  the  expected  number  of  vertices 
that  become  infected.  These  are  obviously  good  candidates  to  treat  with  appropriate 
medications. 

Diffusion  Model  4.3  (SIR  model).  Lets  =  (l f  E,evert,£edge,w)  be  an  SN  where 
each  edge  is  labeled  with  the  predicate  symbol  e  and  w((v,v'))  =  pvy  assigns  a  prob¬ 
ability  of  transmission  to  each  edge  .  We  use  the  predicate  inf  to  designate  the  initially 
infected  vertices.  In  order  to  create  a  GAP  1 1 57/?  capturing  the  SIR  model  of  disease 
spread,  we  first  define  trec  predicate  symbols  rec\ , . . .  ,rectrec  where  reci(v)  intuitively 
means  that  node  v  was  infected  i  days  ago.  Hence,  rectrec(v)  means  that  v  is  “removed.” 
We  embed  S  into  GAP  I  [5/ r  by  adding  the  following  diffusion  rules.  Iftrec  >  1,  we  add 
a  non-ground  rule  for  each  i  =  {2, . . . ,  trec}  -  starting  with  trec: 

rec.fV)  :  R  <-  recy i(V)  :  R 
reci(V)  :  R  inf{V)  :  R 

infly)  :  (1  —  R)  ■  Pyy  •  Pv>  •  (1  -  R')  <-  rectrec(V)  :  R  A  e(V' ,V)  :  Pyy  A 

inf(V')  :  PVt  A  rectrec(V')  :  If. 

The  first  rule  says  that  if  a  vertex  is  in  its  (i  —  l)’th  day  of  recovery  with  confidence  R 
in  the  j’th  iteration  of  the  TnsJJ?  operator,  then  the  vertex  is  i  days  into  recovery  (with 
the  same  confidence)  in  the  j  +  l’th  iteration  of  the  operator.  Likewise,  the  second  rule 
intuitively  encodes  the  fact  that  if  a  vertex  became  infected  with  confidence  R  in  the 
j’th  iteration  of  the  TnSJK  operator,  then  the  vertex  is  one  day  into  recovery  in  the 
j  +  l’th  iteration  of  the  operator  with  the  same  confidence.  The  last  rule  says  that  if  a 
vertex  V'  was  infected  with  confidence  Py  and  has  not  been  removed  with  confidence 
1  —  If,  and  there  is  an  edge  from  W  to  V  in  the  social  network  (weighted  with  Pyy), 
given  the  confidence  1  —  R  that  V  has  not  already  been  removed,  then  the  confidence 
that  the  vertex  V  gets  infected  is  (1  —  R)  ■  Py.v  •  Py(  1  —  R')-  Here,  Py  ( 1  —  R')  is  one 
way  of  measuring  the  confidence  that  V'  is  infected  and  has  not  recovered  and  Pyy  is 
the  confidence  of  infecting  the  neighbor.  Notice  that  e  is  a  static  property  of  the  graph 
which  does  not  change  over  the  time,  so  we  do  not  need  time  indexes  for  it.  As  for  inf, 
the  reason  why  we  can  avoid  using  time  indexes  is  that  we  can  keep  track  of  how  much 
time  has  gone  since  a  vertex  got  infected  with  the  predicates  reci  using  the  second  rule. 

To  see  how  this  GAP  works,  we  execute  a  few  iterations  of  the  TnsJH  operator  and 
show  the  fixpoint  that  it  reaches  on  the  small  graph  shown  in  Figure  4.  In  this  graph, 
the  initial  infected  vertices  are  those  shown  as  a  shaded  circle.  The  transmission  prob¬ 
abilities  weight  the  edges  in  the  graph. 

The  SNDOP  query  (SUM,  0,  k,  inf,  inf)  can  be  used  to  compute  the  expected  number 
of  infected  vertices  in  the  least  fixpoint  of  Tn.  This  query  says  “find  a  set  of  at  most 
k  vertices  in  the  social  network  which,  if  infected,  would  cause  the  maximal  number 
of  vertices  to  become  infected  in  the  future.”  However,  the  above  set  of  rules  can 
be  easily  used  to  express  other  things.  For  instance,  an  epidemiologist  may  not  be 
satisfied  with  only  one  set  of  k  vertices  that  can  cause  the  disease  to  spread  to  the 
maximum  extent  -  as  there  may  be  another,  disjoint  set  of  k  vertices  that  could 
cause  the  same  effect.  The  epidemiologist  may  want  to  find  all  members  of  the 
population,  that  if  present  in  a  group  of  size  k  could  spread  the  disease  to  a  max- 
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Shaded  vertices  are  infected. 
Edges  are  bi-directional, 


1 

inf(a):l,  inf(c):l,  inf(h):l 

2 

recja):l,  recjcfl,  recjh):l,  inf(b):0.2,  inf(d):0.3, 
inf(f):0.3,  inf(g):0.05,  inf(i):0.1 

T 

rec2(a):l,  rec2(c):l,  rec2(h):l,  recJbj-.OJ, 
recJdjiOJ,  recJfjiOJ,  recJgjiO.OS,  recJiftO.l 
inf(q):0.08 

4 

rec2(b):0.2,  rec2(d):0.3,  rec2(f):0.3,  rec2(g):0.05, 
rec7(i):0.1,  rec,(g):0.08 

T 

rec2(g):0.08 

Fig.  4.  Left:  Sample  network  for  disease  spread.  Right:  annotated  atoms  entailed  after  each  application  of 
TnSJR  (maximum,  non-zero  annotations  only). 


imum  extent.  This  can  be  answered  using  a  SNDOP-ALL  query,  described  in  Section  3. 

The  SIS  Model  of  Disease  Spread.  The  SIS  (Susceptible-Infectious-Susceptible) 
model  [Hethcote  1976]  is  a  variant  of  the  SIR  model  where  an  individual  becomes  sus¬ 
ceptible  to  disease  after  recovering  (as  opposed  to  SIR,  where  an  individual  acquires 
permanent  immunity).  SIS  can  be  easily  represented  by  a  simple  modification  to  the 
SIR  model. 

DIFFUSION  Model  4.4  (SIS  MODEL).  Take  Diffusion  Model  4.3  and  change  the 
third  rule  to 

inf(V)  :  Pv,y  •  Pv,  ■  (1  -  R!)  <-  e(V',  V)  :  Pyy  A  inf{V')  :  Py  A  rectrec{V')  :  R' . 

Here,  we  do  not  consider  the  probability  that  vertex  V  is  immune  -  hence  the 
probability  of  recovery  does  not  change  the  probability  of  becoming  infected. 

Diffusion  in  the  Flickr  Photo  Sharing  Network.  The  Flickr  social  network  al¬ 
lows  users  to  share  photographs.  Users  create  a  list  of  “favorite”  photos  that  can  be 
viewed  by  other  users.  [Cha  et  al.  2008]  use  a  variant  of  SIS  above  to  study  how  pho¬ 
tographs  spread  to  the  favorite  lists  of  different  users.  A  key  difference  is  that  they  do 
not  consider  a  node  “recovered”  -  i.e.  once  a  photo  was  placed  on  a  favorite  list,  it  was 
relatively  permanent  (the  study  was  conducted  over  about  100  days).  They  also  found 
that  photos  lower  on  a  favorite  list  (as  the  result  of  a  user  marking  a  large  number  of 
photos  as  “favorite”)  for  a  given  user  could  still  spread  through  the  network.  A  simple 
GAP  that  captures  the  intuition  of  how  Flickr  photos  spread  according  to  [Cha  et  al. 
2008]  uses  just  one  rule: 

Diffusion  Model  4.5  (Flickr  Photo  Diffusion). 

photo^lV)  :  consti  ■  Xj  connected  Jo  (V ’ ,  V)  :  1  A  photo^V')  : 

In  Diffusion  Model  4.5,  the  annotation  of  the  vertex  atom  photo i(U)  is  the  confidence 
that  vertex  V  has  marked  photo  i  as  one  of  its  favorites.  The  predicate  connected  Jo  is 
the  sole  edge  label  representing  that  there  is  a  connection  from  vertex  V'  to  V  (users 
select  other  users  on  this  network).  Additionally,  the  value  consti  is  a  number  in  [0, 1] 
that  determines  how  a  given  photo  spreads  in  the  network.  Notice  that  the  above  rule 
is  linear,  as  the  head  is  a  linear  combination  and  consti  €  [0, 1].  We  note  that  for  all  of 
these  models,  the  annotation  functions  reflect  one  interpretation  of  the  confidence  that 
a  vertex  is  infected  or  recovered  -  others  are  possible  in  our  framework. 
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4.3.  Homophilic  Diffusion  Models 

Recently,  [Aral  et  al.  2009]  studied  the  spread  of  mobile  application  use  on  a  global 
instant-messaging  network  of  over  27  million  vertices.  They  found  that  network-based 
diffusion  could  overestimate  the  spread  of  a  mobile  application  and,  for  this  scenario, 
over  50%  of  the  adopted  use  of  the  applications  was  due  to  homophily  -  vertices 
with  similar  properties  adopting  similar  applications.  Further,  the  recent  experiment 
of  [Centola  2011]  illustrates  that  homophily  plays  a  role  in  enhancing  adoption  under 
the  tipping  model. 

These  results  should  not  be  surprising  -  the  basic  idea  behind  web-search  advertis¬ 
ing  is  that  two  users  with  a  similar  property  (search  term)  will  be  interested  in  the 
same  advertised  item.  In  fact,  [Cha  et  al.  2008]  explicitly  pre-processed  their  Flickr 
data  set  with  a  heuristic  to  eliminate  properties  attached  to  vertices  that  could  not  be 
accounted  for  by  a  diffusion  process.  We  can  easily  represent  homophilic  diffusion  in  a 
GAP  with  the  following  type  of  diffusion  rule: 

Diffusion  Model  4.6  (Homophilic  Diffusion  of  a  Product). 

buys  ^product  (V)  :  0.5  x  X  G-  property(V)  :  1  A  exposed  to  product(V)  :  X 

In  Diffusion  Model  4.6,  if  a  vertex  is  exposed  to  a  product  (e.g.  through  mass  advertis¬ 
ing)  and  has  a  certain  property,  then  the  person  associated  with  the  vertex  purchases 
the  product  with  a  confidence  of  0.5  x  X,  where  X  measures  the  extent  of  the  exposure. 
For  this  rule,  there  are  no  network  effects. 

In  [Watts  and  Peretti  2007],  the  authors  propose  a  “big  seed”  marketing  approach 
that  combines  both  homophilic  and  network  effects.  They  outline  a  strategy  of  advertis¬ 
ing  to  a  large  group  of  individuals  who  are  likely  to  spread  the  advertisement  further 
through  network  effects.  We  now  describe  a  GAP  that  captures  the  ideas  underlying 
big  seed  marketing.  Suppose  we  have  a  set  of  vertex  predicate  symbols  AL  C  VP  cor¬ 
responding  to  people  “attributes”  -  these  may  be  certain  demographic  characteristics 
such  as  education  level,  race,  level  of  physical  fitness,  etc..  Suppose  we  want  to  adver¬ 
tise  to  people  having  (at  least)  one  of  k  <  |AL|  attributes  to  maximize  an  aggregate  agg 
with  respect  to  a  goal  predicate  g  (in  other  words,  we  want  to  choose  k  attributes  and 
advertise  to  people  having  those  attributes  so  that  agg  with  respect  to  g  is  maximized). 
Consider  the  following  construction. 

Diffusion  Model  4.7  (Big  Seed  Marketing).  The  GAP  includes  an  embed¬ 
ding  of  the  social  network  as  well  as  the  network  diffusion  model  of  the  user’s  choice.  We 
make  the  the  following  additions  to  the  GAP  and  the  SN: 

( 1 )  Add  vertex  predicate  symbol  attrib  to  VP. 

(2)  For  each  Ibl  G  AL,  add  a  vertex  vm  to  V.  We  also  set  £Vert(vibi)  =  {attrib}. 

(3)  For  each  Ibl  G  AL,  add  the  following  non-ground  rule: 

g(V)  :  effm  x  X  <r-  lbl{V)  :  1  A  g(vlbl)  :  X 

where  eff  M  is  a  constant  in  [0, 1]  corresponding  to  the  confidence  that,  if  advertised 
to,  a  vertex  v  with  attribute  Ibl  obtains  an  annotation  ofl  on  g(v). 

Our  SNDOP  query  is  (agg,VC,k,g(V),g(V)),  where  VC  =  {attrib(V)  :  1}. 

Note  that  in  the  above  diffusion  model,  the  vm  vertices  correspond  to  advertisements 
directed  toward  different  vertex  properties.  The  VC  condition  forces  the  query  to  only 
return  viu  vertices.  As  an  example,  a  solution  like  [givihi,  ),g{vwi2)}  means  that  we 
are  targeting  people  having  attribute  Ibl i  or  lbl2.  The  diffusion  rule,  added  per  label, 
ensures  that  the  mass  advertisement  is  received  and  that  the  vertex  acts  accordingly 
(hence  the  effm  constants). 
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We  close  this  section  with  a  note  that  while  all  diffusion  models  mentioned  here  have 
been  developed  by  others  and  have  been  shown  above  to  be  representable  via  GAPs, 
none  of  these  papers  has  developed  algorithms  to  solve  SNDOP  queries.  We  emphasize 
that  not  only  do  we  give  algorithms  to  answer  SNDOP  queries  in  the  next  section,  our 
algorithms  take  any  arbitrary  diffusion  model  that  can  be  expressed  as  a  GAP,  and  an 
objective  function  as  input.  In  addition,  our  notion  of  a  social  network  is  much  more 
general  than  that  of  many  existing  approaches. 

5.  ALGORITHMS 

In  this  section  we  study  how  to  solve  SNDOPs  algorithmically. 

5.1.  Naive  Algorithm 

The  naive  algorithm  for  solving  a  SNDOP  query  is  to  first  find  all  pre-answers  to 
the  query.  Then  compute  the  function  value  for  each  pre-answer  and  find  the  best. 
This  is  obviously  an  extremely  expensive  algorithm  that  is  unlikely  to  terminate  in  a 
reasonable  amount  of  time. 

An  execution  strategy  that  first  finds  all  vertices  in  a  social  network  S  that  satisfy 
the  vertex  condition  and  then  somehow  restricts  interest  to  those  vertices  in  the  above 
computation  (where  S  is  embedded  in  a  GAP  II)  would  not  be  correct  for  two  reasons. 
First,  lfp(  Tn)  assigns  a  truth  value  to  each  ground  vertex  atom  A  that  might  be  dif¬ 
ferent  from  what  is  initially  assigned  within  the  social  network.  Second,  when  we  add 
a  new  ground  vertex  atom  A  to  II  (e.g.  in  our  cell  phone  example,  when  we  consider 
the  possibility  of  assigning  a  free  calling  plan  to  a  vertex  v),  it  might  be  the  case  that 
vertices  that  previously  did  not  satisfy  the  vertex  condition  PC  do  so  after  the  addition 
of  A  ton. 

5.2.  A  Non-Ground  Algorithm  in  the  Monotonic  Case 

There  are  three  major  problems  with  the  Naive  algorithm.  The  first  problem  is  that  the 
aggregate  function  is  very  general  and  has  no  properties  that  we  can  take  advantage 
of.  Hence,  we  can  show  that  the  entire  search  space  might  need  to  be  explored  if  an  ar¬ 
bitrary  aggregate  function  is  used.  The  second  problem  is  that  it  works  on  the  “ground” 
instantiation  of  n.  The  third  problem  is  that  the  Tn  operator  maps  all  ground  atoms 
to  the  [0, 1]  interval  and  there  can  be  a  very  large  number  of  ground  atoms  to  consider. 
For  instance,  if  we  have  a  very  small  social  network  with  just  1000  vertices  and  a  rule 
with  3  variables  in  it,  that  rule  has  109  possible  ground  instances  -  an  enormous  num¬ 
ber.  All  these  problems  are  further  aggravated  by  the  fact  that  fixpoints  might  have  to 
be  computed  several  times. 

In  this  section,  we  provide  an  algorithm  to  compute  answers  to  a  SNDOP  query  un¬ 
der  the  assumption  that  our  aggregate  function  is  monotonic  and  under  the  assumption 
that  all  rules  in  a  GAP  have  the  form  A  :  f(p i, . . . ,  pn)  4—  B\  :  p\,  A  -  ,  Bn  :  pn,  where 

each  gi  is  a  member  of  [0, 1]  U  AVar. 

In  this  case,  we  define  a  non-ground  interpretation  and  a  non-ground  fixpoint  op¬ 
erator  Sn-  This  leverages  existing  work  on  non-ground  logic  programming  initially 
pioneered  by  [M.  Falaschi  and  Palamidessi  1988]  and  later  adapted  to  different  logic 
programming  extensions  by  [Gottlob  et  al.  1996;  Eiter  et  al.  1997;  Stroe  and  Subrah¬ 
manian  2003].  We  will  use  A*  to  denote  the  set  of  all  atoms  (ground  and  non-ground). 
We  start  by  defining  a  non-ground  interpretation. 

Definition  5.1.  A  non-ground  interpretation  is  a  partial  mapping  NG  :  A*  — >  [0, 1], 
Every  non-ground  interpretation  NG  represents  an  interpretation  grd(NG)  defined  as 
follows:  grd(NG)(A)  =  max{AG(A')  |  A  is  a  ground  instance  of  A'}.  When  there  is  no 
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atom  A!  which  has  A  as  a  ground  instance  and  for  which  NG(A')  is  defined,  then  we 

set  grd(NG)(A)  =  0. 

Thus,  in  a  language  with  just  three  constants  a,  b,  c  and  one  predicate  symbol  p,  the 
non-ground  interpretation  that  maps  p{X,  a )  to  0.5  and  everything  else  to  0  corresponds 
to  the  interpretation  that  assigns  0.5  to  p(a,  a),p(b ,  a)  and  p(c,  a)  and  0  to  every  other 
ground  atom.  Non-ground  interpretations  are  succinct  representations  of  ordinary  in¬ 
terpretations  -  they  try  to  keep  track  only  of  assignments  to  non-ground  atoms  (not 
necessarily  all  ground  atoms)  and  they  do  not  need  to  worry  about  atoms  assigned  0. 
We  now  define  a  fixpoint  operator  that  maps  non-ground  interpretations  to  non-ground 
interpretations. 

Definition  5.2  ( operator  Sn)-  The  operator  Sn  associated  with  a  GAP  II  maps 
a  non-ground  interpretation  AG  to  the  non-ground  interpretation  Sn  (AG)  where 
Sn(AG)(A')  =  max{/(Xi, . .  .,Xn)  \  A  :  f(p i, . . . , pn)  •(-  Bi  :  pn  A  . . .  A  Bn  :  pn  is  a  rule 
in  II  and  there  exist  atoms  ( B[ , . . . ,  B'n)  such  that  AG(B')  is  defined  for  all  1  <  i  <  n, 
(Bi, . . . , Bn)  and  (B[, ... .  B'n)  are  simultaneously  unifiable  via  a  most  general  unifier 
6,  A'  =  Ad,  and  (i)  if  //,  is  a  constant,  then  NG{B[)  >  -  in  this  case  Xz  =  pn,  and  (ii) 

if  ^  is  a  variable,  then  X,  =  NG(B')}.  (In  this  definition,  without  loss  of  generality,  we 
assume  the  variables  occurring  in  rules  in  II  are  mutually  standardized  apart  and  are 
also  different  from  those  in  AG). 

The  fixpoint  operator  Sn  delays  grounding  to  the  maximal  extent  possible  by  (i)  only 
looking  at  the  rules  in  II  directly  rather  than  ground  instances  of  rules  in  II  (which 
is  what  Tn  does),  and  (ii)  by  trying  to  assign  values  to  non-ground  atoms  rather  than 
ground  instances  -  unless  there  is  no  other  way  around  it.  The  following  example  shows 
how  Sn  works. 

Example  5.3.  For  illustrative  purposes,  suppose  we  have  the  following  GAP  II: 

e(V,  a)  :  1 
p(V)  :  0.7 

q(V')  :  X  <-  p(V’)  :  X,  e{V',  a)  :  0.5 

Let  us  apply  Sn  till  we  reach  a  fixed  point.  With  the  first  application,  we  entail  the 
(non-ground)  annotated  atoms  e(V,a)  :  1  ,p(V)  :  0.7  (we  use  the  first  and  second  facts). 
With  the  next  application,  q(V')  :  0.7  is  entailed.  In  fact,  notice  that  the  atoms  in 
the  body  of  the  third  rule,  namely  p(V'),  e(V',  a),  can  be  unified  with  e(V,  a),p(V),  for 
which  the  interpretation  obtained  in  the  first  iteration  is  defined;  moreover,  the  value 
of  e(V,  a)  in  the  interpretation  is  greater  than  0.5.  Notice  also  that  in  assigning  a  value 
to  q(V'),  the  value  of  p(V)  in  the  current  interpretation  is  used,  that  is,  0.7  is  used  in 
place  of  X.  At  this  point  the  least  fixed  point  is  reached. 

Consider  the  ordering  A  defined  as  follows  on  non-ground  interpretations:  NG\  A  AG2 
iff  grd(NGi)  <  grd(NG2)-  In  this  case,  it  it  easy  to  see  that: 

Proposition  5.4.  Suppose  n  is  any  GAP.  Then: 

( 1 )  Sn  is  monotonic. 

(2)  Sn  has  a  least  fixpoint  lfp(Su)  and  lfp( Tn)  =  grd(lfp(Su))-  That  is,  lfp(S n)  is  a 
non-ground  representation  of  the  (ground)  least  fixpoint  operator  Tn- 

In  short,  Sn  is  a  version  of  Tn  that  tries  to  work  in  a  non-ground  manner  as  much  as 
possible.  We  now  present  the  SNDOP-Mon  algorithm  to  compute  answers  to  a  SNDOP 
query  (agg,VC,k,gi(V),go{V))  when  agg  is  monotonic.  The  SNDOP-Mon  algorithm 
uses  the  following  notation:  value(NG)  is  the  same  as  value(grd(NG))  and  AG  satisfies 
a  formula  iff  grd(NG)  satisfies  it. 
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SNDOP-Mon(n,  agg,  VC,k,gI(V),go{V )) 

(1)  The  variable  Curr  is  a  tuple  consisting  of  a  GAP  and  natural  number.  We  initialize 

Curr.Prog  =  II;  Curr. Count  =  0. 

(2)  Todo  is  a  set  of  tuples  described  in  step  1.  We  initialize  Todo  =  {Curr} 

(3)  Initialize  the  real  number  bestVal  =  0  and  GAP  bestSOL  =  NIL 

(4)  while  Todo  ^  0  do 

(a)  Cand  =  first  member  of  Todo 

(b)  if  value(lfp(Sc  and.  Prog}}  ^  bestv Cil  A  l  f  piScand.Prog}  ^  V  C  then 
i.  bestVal  =  value(lfp(Scand.prog}',  bestSOL  =  Cand 

(c)  if  Cand.Count  <  k  then 

i.  For  each  ground  atom  gi(v),  s.t.  {BOtherCand  £  Todo  where 
Other C and.  Prog  D  Cand. Prog, 

\OtherCand.Prog\  <  \Cand.Prog\  +  1, 

and  l  f p(S  other  Cand.  Prog )  |=  gi (v)  :  1,  do  the  following: 

A.  Create  new  tuple  NewCand. 

Set  NewCand. Prog  =  Cand. Prog  U  {gi(v}  :  1}. 

Set  New.Count  =  Cand.Count  +  1) 

B.  Insert  NewCand  into  Todo 

ii.  Sort  the  elements  of  Element  £  Todo  in  descending  order  of  value(Element.Prog), 
where  the  first  element,  Top  £  Todo,  has  the  greatest  such  value  (i.e.  there  does  not 
exist  another  element  Top'  s.t.  value{Top' .Prog}  >  value(Top.Prog}) 

(d)  Todo  =  Todo  —  {Cand} 

(5)  if  bestSOL  ^  NIL  then  return  (bestSOL. Prog  —  ft)  else  return  NIL. 


no 


The  following  example  shows  how  the  SNDOP-Mon  algorithm  works. 

Example  5.5.  Consider  Example  3.16  from  page  14  where  we  present  a  social 
network  and  some  diffusion  rules  for  disease  spread  embedded  in  program  II disease- 
Suppose,  we  want  to  answer  a  SNDOP  query  (II disease,  SUM,  true,  2, inf(V),inf{V}). 
The  search-tree  in  Figure  5  illustrates  how  SNDOP-Mon  searches  for  an  optimal 
solution  to  the  query.  In  the  figure,  we  labeled  each  node  with  the  set  of  vertices 
and  a  real  number.  The  vertices  correspond  to  the  vertex  atoms  (annotated  with 
1)  formed  with  inf  added  to  GAP  in  step  4(c)i.  The  real  number  corresponds  to 
the  value  resulting  from  this  addition.  Underlined  nodes  in  the  search  tree  rep¬ 
resent  potential  solutions  where  bestVal  and  bestSOL  are  updated.  Notice,  that, 
for  example,  the  set  {va,v{\  is  never  considered.  This  is  because  inf(v\ )  is  entailed 
anytime  a  candidate  solution  includes  v&.  The  optimal  solution  is  found  to  be 
{^7,  U5}.  In  this  example,  the  algorithm  considers  solutions  in  the  following  order: 
{}>  {^4},  {r4,  v7},  {v4,  u5},  {^4,  we},  {^7},  {v7,  r5},  {v7,  n},  {v7,  v2 },  {v7,  v3}.  {w5}, 
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{^5,  V6},  {^5,  «l},  {^5,  ^2},  {^5,  ^3},  (A)}*  {^6,  vi},  {^6,  V2},  {v6,  r3},  {ui},  {^2>,  {^3>- 


The  following  result  states  that  the  SNDOP-Mon  algorithm  is  correct. 

THEOREM  5.6.  Given  SNDOP  query  Q  =  (■ agg,VC,k,gi(V),go{V ))  and  a  GAP  II 
embedding  a  social  network  S,  if  agg  is  monotonic  then: 

•  There  is  an  answer  to  the  SNDOP  query  Q  w.r.t.  II  iff  SNDOP- 
Mon( II,  agg,  VC,  k,  gi(V),  go(V))  does  not  return  NIL. 

•  If  SNDOP-Mon(U ,  agg,  VC,  k,  gi(V),go(V))  returns  any  result  other  than  NIL,  then 
that  result  is  an  answer  to  the  SNDOP  query  Q  w.r.t.  II. 

5.3.  Approximation  Algorithms:  GREEDY-SNDOP 

Even  though  SNDOP-Mon  offers  advantages  such  as  pruning  of  the  search  tree  and 
leverages  non-ground  operations  to  increase  efficiency  over  the  naive  algorithm,  it  is 
still  intractable  in  the  worst  case.  Regretfully,  Theorem  3.17  precludes  an  exact  so¬ 
lution  in  PTIME  and  Theorem  3.20  precludes  a  PTIME  a-approximation  algorithm 
where  “  <  JPT-  Both  of  these  results  hold  for  the  restricted  case  of  linear-GAPs  and 
positive-linear  aggregate  functions. 

The  good  news  is  that  we  were  able  to  show  that  (i)  for  linear-GAPs  and  a-priori 
VC  queries  with  positive-linear  aggregates,  the  value  function  is  submodular  (Theo¬ 
rem  3.15).  (ii)  Under  these  conditions,  we  can  reduce  the  problem  to  the  maximiza¬ 
tion  of  a  submodular  function  over  a  uniform  matroid  (the  uniformity  of  the  ma- 
troid  is  proved  in  Lemma  3.14  for  a-priori  VC  queries),  (iii)  We  can  leverage  the  work 
of  [Nemhauser  et  al.  1978]  that  admits  a  greedy  rf.t  approximation  algorithm.  In  this 
section,  we  develop  a  greedy  algorithm  for  SNDOP  queries  that  leverages  the  above 
three  results. 

The  GREEDY-SNDOP  algorithm  shown  below  assumes  a  linear  GAP,  an  a-priori  VC 
query  with  positive-linear  aggregates,  and  a  zero-starting  value  function  (notice  that 
the  latter  requirement  can  be  met  as  stated  by  Proposition  3.11).  The  algorithm  pro¬ 
vides  ef1  approximation  to  the  SNDOP  query  problem.  As  this  matches  the  upper 
bound  of  Theorem  3.20,  we  cannot  do  better  in  terms  of  an  approximation  guarantee. 


GREEDY-SNDOP(II,  agg,  VC,  k,gi(V),go(V))  returns  SOL  C  V 

(1)  Initialize  SOL  =  0  and  REM  =  {r  e  V|  ({j;(«)  :  l}UUFedg,„ert,„){preii(v)  :  i})  |= 
VC[V/v]} 

(2)  While  \SOL\  <  k  and  REM  £  0 

(a)  vbest  =  null,  val  =  value(SOL),  inc  =  0 

(b)  For  each  v  £  REM,  do  the  following 

i.  Let  inc„ew  =  value{SOL  U  {«})  —  val 

ii.  If  incnew  >  inc  then  inc  =  incnew  and  Vbest  =  v 

(c)  SOL  =  SOL  U  {tw},  REM  =  REM  -  {r6est} 

(3)  Return  SOL 


We  now  analyze  the  time  complexity  of  GREEDY-SNDOP. 

PROPOSITION  5.7.  Given  a  SNDOP  query  Q  =  ( agg,VC,  k,  gi(V),  go(V )),  a  social 
network  S,  and  a  GAP  II  D  1 1 5,  the  complexity  of  GREEDY-SNDOP  is  0(k  ■  \  V\  ■  P(  \ V\  )) 
where  F(|  V|)  is  the  time  complexity  to  compute  value(V)  for  some  set  V  C  Vof  size  k. 
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Table  V.  First  iteration  of  the  greedy  algorithm 


Vertex  Atom 

Vl 

V2 

V3 

V5 

V7 

^9 

^10 

buys. earner  a{y\ ) 

1.0 

0.5 

0.0 

0.5 

0.0 

0.0 

0.0 

buys. carrier  a(v2) 

0.0 

1.0 

0.0 

0.0 

0.0 

0.0 

0.0 

buys. earner  a{v  3) 

0.0 

1.0 

1.0 

0.0 

0.0 

0.0 

0.0 

buys. earner  a(v  4) 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

buys. earner  a(v  5) 

0.0 

0.0 

0.0 

1.0 

0.0 

0.0 

0.0 

buys. earner  a(vQ ) 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

buys. earner  a{vj) 

0.0 

0.25 

0.25 

0.0 

1.0 

0.0 

0.0 

buys. earner  a(vs) 

0.0 

0.5 

0.5 

0.0 

0.0 

0.0 

0.0 

buys. earner  a(vg ) 

0.33 

0.5 

0.33 

0.17 

0.0 

1.0 

0.33 

buys. earner  a(v  10 ) 

0.0 

0.5 

0.5 

0.0 

0.0 

0.0 

1.0 

SUM 

1.33 

4.25 

2.58 

1.67 

1.0 

1.0 

1.33 

Table  VI.  Incremental  Increases  for  Both  Iterations  of  GREEDY- 
SNDOP 


Vertex 

Incremental  Increase 
on  First  Iteration 

Incremental  Increase 
on  Second  Iteration 

Vl 

1.33 

0.67 

V2 

4.25 

NA 

V3 

2.58 

0.0 

1.67 

1.67 

V7 

1.0 

0.75 

V9 

1.0 

0.5 

V10 

1.33 

0.67 

We  note  that  most  likely,  the  most  expensive  operation  is  the  computation  of  value 
at  line  2(b)i.  One  obvious  way  to  address  this  issue  is  by  using  a  non-ground  version  of 
the  fixed-point. 

THEOREM  5.8.  Given  a  SNDOP  query  Q  =  ( agg,VC,k,gi(V),go(V )),  a  social  net¬ 
work  S,  and  a  GAP  II  D  As,  if 

—  II  is  a  linear  GAP, 

—  Q  is  a-priori  VC, 

—  agg  is  positive-linear,  and 

—  value  is  zero-starting, 

then  GREEDY-SNDOP  is  an  ’^-approximation  algorithm. 

Example  5.9.  Consider  Example  4.1  and  program  II(m  from  page  19.  Assume  we 
have  an  additional  vertex  predicate  symbol  pro  assigned  to  professional  photographers 
(who  are  depicted  with  shaded  vertices  in  Figure  3).  Consider  the  SNDOP  query  where 
agg  =  SUM,  VC  =  {pro(V)},  k  =  2,  gi(V)  =  buys. earner  a{V),  go{V )  =  buy  s. earner  a  (V). 

On  the  first  iteration  of  GREEDY-SNDOP,  the  algorithm  computes  the  value  for  all 
vertices  in  the  set  REM  which  are  iq,  v2,v3,  v5,  v7 ,  t;9 ,  r10.  The  resulting  annotations  of 
the  fixed  points  and  aggregates  are  shown  in  Table  V. 

As  value($)  =  0,  the  incremental  increase  afforded  by  v2  is  4.25  -  and  clearly  the 
greatest  of  all  the  vertices  considered.  GREEDY-SNDOP  adds  v2  to  SOL,  removes  it 
from  REM  and  proceeds  to  the  next  iteration.  Table  VI  shows  the  incremental  in¬ 
creases  for  the  second  iteration.  As  v5  provides  the  greatest  increase,  it  is  picked,  and 
the  resulting  solution  is  {v2,  r5}. 

6.  IMPLEMENTATION  AND  EXPERIMENTS 

We  have  implemented  the  GREEDY-SNDOP  algorithm  in  660  lines  of  Java  code  by 
re-using  and  extending  the  diffusion  modeling  Java  library  of  [Broecheler  et  al.  2010] 
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(approx  35K  lines  of  code).  Our  implementation  uses  multiple  threads  in  the  inner 
loop  of  the  GREEDY-SNDOP  algorithm  to  increase  efficiency.  All  experiments  were 
executed  on  the  same  machine  with  a  dedicated  4-core  2.4GHz  processor  and  22GB 
of  main  memory.  Times  were  measured  to  millisecond  precision  and  are  reported  in 
seconds. 

6.1.  Experimental  Setting 

Data  set.  In  order  to  evaluate  GREEDY-SNDOP,  we  used  a  real-world  dataset  based 
on  a  social  network  of  Wikipedia  administrators  and  authors.  Wikipedia  is  an  online 
encyclopedia  collaboratively  edited  by  many  contributors  from  all  over  the  world. 
Selected  contributors  are  given  privileged  administrative  access  rights  to  help  main¬ 
tain  and  control  the  collection  of  articles  with  additional  technical  features.  A  vote  by 
existing  administrators  and  ordinary  authors  determines  whether  an  individual  is 
granted  administrative  privileges.  These  votes  are  publicly  recorded.  [Leskovec  et  al. 
2010]  crawled  2794  elections  from  the  inception  of  Wikipedia  until  January  2008. 
The  votes  casted  in  these  elections  give  rise  to  a  social  network  among  Wikipedia 
administrators  and  authors  by  representing  a  vote  of  user  i  for  user  j  as  a  directed 
edge  from  node  i  to  j.  In  total,  the  dataset  contains  103,663  votes  (edges)  connecting 
more  than  7000  Wikipedia  users  (vertices).  Hence,  the  network  is  large  and  densely 
connected.11 

SNDOP-Query.  In  our  experiments,  we  consider  the  hypothetical  problem  of  finding 
a  set  of  administrators  having  the  highest  overall  influence  in  the  Wikipedia  social 
network  described  above.  We  treat  votes  as  a  proxy  for  the  inverse  of  influence.  In  other 
words,  if  user  i  voted  for  user  j,  we  assume  user  j  (intentionally  through  lobbying  or 
unintentionally  through  the  force  of  his  contributions  to  Wikipedia)  influenced  user  i 
to  vote  for  him.  All  edges  are  assigned  a  weight  of  1.  Our  SNDOP  queries  are  designed 
as  per  the  following  definition. 

Definition  6. 1  (Wikipedia  SNDOP-Query).  Given  some  natural  number  k  >  1,  a 
Wikipedia  SNDOP  query,  WQ(k)  is  specified  as  follows: 

—  agg  =  SUM  -  the  intuition  is  that  the  aggregate  provides  us  an  expected  number  of 
vertices  that  are  influenced. 

—  VC  =  0  -  we  do  not  use  a  vertex  condition  in  our  experiments 

—  k  as  specified  by  the  input 

—  9i{V)  =  go(V)  =  influenced(V ) 

Diffusion  Models  Used.  We  represented  the  diffusion  process  with  two  different 
models:  one  tipping  and  one  cascading. 

—  Cascading  diffusion  model.  We  used  the  Flickr  Diffusion  Model  (Diffusion 
Model  4.5  on  page  22)  described  in  Section  4.2.  In  this  model,  a  constant  parameter 
a  represents  the  “strength”  or  “likelihood”  of  influence.  The  larger  the  parameter  a 
the  higher  the  influence  of  a  user  on  those  who  voted  for  her. 

—  Tipping  diffusion  model.  [Cha  et  al.  2009]  shows  that  there  is  a  relationship  be¬ 
tween  the  likelihood  of  a  vertex  marking  a  photo  as  a  favorite  and  the  percentage 
of  their  neighbors  that  also  marked  that  photo  as  a  favorite.  This  implies  a  tipping- 
model  (as  in  Section  4.1).  We  apply  the  Jackson-Yariv  model  (i.e.  Diffusion  Model  4.2) 


11  Our  Wikipedia  data  set  does  not  include  edge  weights.  However,  including  edge  weights  should  not  appre¬ 
ciably  change  the  experimental  results  which  show  that  solving  SNDOP  queries  when  tipping  models  are 
used  is  faster,  in  general,  than  when  cascade  models  are  used. 
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Runtimes  for  different  a  values 


Fig.  6.  Runtimes  of  GREEDY-SNDOP  for  different  values  of  a  and  k  =  5  in  both  diffusion  models 


with  B  equated  to  influenced.  For  each  vertex  v3  e  V,  we  set  the  benefit  to  cost  ratio 
(  — )  to  1.  Finally,  the  function  7  defined  in  the  Jackson  Yariv  model  is  the  constant- 

C3 

valued  function  (for  all  values  of  x): 


7(2)  =  a. 


This  says  that  irrespective  of  the  number  of  neighbors  that  a  vertex  has,  the  benefit 
to  adopting  strategy  B  (i.e.  influenced)  is  a.  Therefore,  the  resulting  diffusion  rule 
for  the  linear  Jackson-Yariv  model  is: 


influenced(y ) 


£,•  Xi 


<- 


influenced(vj)  :  Xj 

,|(^,v>eE 


For  both  models,  we  derive  a  unique  logic  program  for  each  setting  of  the  parameter 
a.  The  parameter  a  depends  on  the  application  and  can  be  learned  from  ground  truth 
data.  In  our  experiments,  we  varied  a  to  avoid  introducing  bias. 


6.2.  Experimental  Results 

Run-time  of  GREEDY-SNDOP  with  varying  a  and  different  diffusion  models. 

Figure  6  shows  the  total  runtime  of  GREEDY-SNDOP  in  seconds  to  find  the  set  of  k  =  5 
most  influential  users  in  the  Wikipedia  voting  network  for  different  values  of  the 
strength  of  influence  parameter  a.  We  varied  a  from  0.05  (very  low  level  of  influence) 
to  0.5  (very  high  level  of  influence)  for  both  the  cascading  and  tipping  diffusion  model. 
We  observe  that  higher  values  of  a  lead  to  higher  runtimes  as  expected  since  the  scope 
of  influence  of  any  individual  in  the  network  is  larger.  Furthermore,  we  observe  that 
the  runtimes  for  the  tipping  diffusion  model  increase  more  slowly  with  a  compared  to 
the  cascading  model. 


Run-time  of  GREEDY-SNDOP  with  varying  k.  For  the  next  set  of  experiments, 
we  keep  the  strength  of  influence  fixed  to  a  =  0.2  and  varied  k  which  governs  the 
size  of  the  set  of  influencers.  Figure  7  reports  the  runtime  of  GREEDY-SNDOP  for 
the  query  WQ(k)  with  k  =  5,10,15,20,25.  For  the  cascading  model,  the  runtime 
is  approximately  linear  in  k  -  a  curve-fitting  analysis  using  Excel  showed  a  slight 
superlinear  trend  (even  though  the  figure  itself  looks  linear  at  first  sight).  Figure  8 
shows  the  time  taken  to  execute  each  of  the  25  iterations  of  the  outer  loop  for  the 
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Time  to  find  Individuals 


Number  of  computed  individuals 


Fig.  7.  Runtimes  of  GREEDY-SNDOP  for  different  values  of  k  and  a  =  0.2  in  both  diffusion  models 


Time  per  Individual 


Index  of  Individual 


Fig.  8.  Time  per  iteration  of  GREEDY-SNDOP  for  a  =  0.2  in  both  diffusion  models 


query  WQ(25)  with  a  =  0.2.  Note  that  each  subsequent  iteration  is  more  expensive 
than  the  previous  one  since  the  size  of  the  logic  programs  to  consider  increases  with 
the  addition  of  each  ground  atom  influencecL(vi).  However,  we  also  implemented  the 
practical  improvement  of  “lazy  evaluation”  of  the  submodular  function  as  described 
in  [Leskovec  et  al.  2007b].  This  improvement,  which  maintains  correctness  of  the 
algorithms,  stores  previous  improvements  in  total  score  and  prunes  the  greedy  search 
for  the  highest  scoring  vertex  as  discussed.  We  found  that  this  technique  also  reduced 
the  runtime  of  subsequent  iterations. 

Our  experimental  results  show  that  we  can  answer  SNDOP  queries  on  large  social 
networks.  For  example,  computing  the  set  of  five  most  influential  Wikipedia  users  in 
the  voting  network  required  approximately  2  hours  averaged  over  the  different  values 
of  a  in  the  tipping  diffusion  model. 
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7.  RELATED  WORK 

There  has  been  extensive  work  in  reasoning  about  diffusion  in  social  networks.  How¬ 
ever,  to  our  knowledge,  there  is  no  work  on  the  relationship  between  logic  program¬ 
ming  and  social  networks.  Moreover,  there  is  no  general  framework  to  solve  social  net¬ 
work  diffusion  optimization  problems  that  can  take  a  broad  class  of  diffusion  models 
as  input.  We  believe  this  work  represents  the  first  deterministic  framework  for  repre¬ 
senting  generalized  diffusion  models  that  allows  for  different  properties  and  weights 
on  vertices  and  edges.  Previously,  the  authors  presented  the  framework  of  SNDOPs 
in  [Shakarian  et  al.  2010].  However,  this  brief  technical  communication  did  not  in¬ 
clude  either  our  exact  or  approximate  algorithms,  an  implementation,  experiments, 
the  SNDOP-ALL  problem,  many  of  the  complexity  results,  or  many  of  the  constructions 
seen  in  this  paper  (such  as  the  homophilic  diffusion  models  and  big-seed  marketing). 

7.1.  Related  Work  in  Logic  Programming 

We  first  compare  our  work  with  annotated  logic  programming  [Kifer  and  Subrahma¬ 
nian  1992;  Kifer  and  Lozinskii  1992;  Thirunarayan  and  Kifer  1993]  and  its  many  ex¬ 
tensions  and  variants  [Vennekens  et  al.  2004;  Krajci  et  al.  2004;  Lu  1996;  Lu  et  al. 
1993;  Damasio  et  al.  1999;  Kern-Isberner  and  Lukasiewicz  2004;  Lukasiewicz  1998]. 
There  has  been  much  work  on  annotated  logic  programming  and  we  have  built  on 
the  syntax  and  semantics  of  annotated  LP.  [Swift  1999]  describes  how  lattice  answer 
subsumption  can  implement  GAPs  whereas  [Swift  and  Warren  2010]  describes  its  im¬ 
plementation  (as  well  as  the  implementation  of  partial  order  answer  subsumption) 
in  XSB  and  analyzes  its  performance  showing  scalability  for  applications  in  social 
network  analysis,  abstract  interpretation,  and  query  justification.  We  also  note  that 
possibilistic  logic  [Dubois  and  Prade  1990;  Dubois  et  al.  1991]  might  be  extended  to 
handle  the  types  of  calculations  GAPs  support.  In  fact,  GAPs  may  be  viewed  as  such 
an  extension  of  possibilistic  logic.  However,  we  are  not  aware  of  any  work  on  solving 
optimization  queries  (queries  that  seek  to  optimize  an  aggregate  function)  w.r.t.  anno¬ 
tated  logic  programming. 

[Raedt  et  al.  2007]  proposes  a  probabilistic  version  of  Prolog,  called  ProbLog.  A 
ProbLog  program  consists  of  a  set  of  definite  clauses  (like  in  Prolog)  where  each  clause 
is  associated  with  a  probability.  Given  a  ProbLog  program  T,  the  authors  induce  a 
probability  distribution  over  the  space  of  definite  logic  programs  L  C  LT,  where  LT  is 
the  definite  logic  program  obtained  from  T  by  stripping  out  the  probabilities.  The  prob¬ 
ability  of  L  C  Lt  is  obtained  as  Pi  x  lie  ez,T-z,(l  ~ Pi)’  where  p,  is  the  probability 
of  clause  c*.  The  probability  that  a  query  succeeds  is  the  sum  of  the  probabilities  of  the 
Prolog  programs  L  C  LT  where  the  query  succeeds.  The  semantics  of  ProbLog  is  called 
the  distribution  semantics',  it  has  been  borrowed  from  PRISM  [Sato  1995].  Basically, 
in  the  distribution  semantics  all  facts  are  assumed  to  be  mutually  independent  [Hom- 
mersom  and  Lucas  2011].  Similar  assumptions  are  made  in  certain  other  logics  such 
as  Independent  Choice  Logic  [Poole  2008]  and  PRISM  [Sato  1995;  Sneyers  et  al.  2010]. 
Ng  and  Subrahmanian  [Ng  and  Subrahmanian  1992;  1993]  propose  probabilistic  logic 
programs  where  the  independence  assumption  is  not  required  -  but  this  is  computa¬ 
tionally  expensive  though  recent  approaches  [Khuller  et  al.  2007]  based  on  sampling 
have  been  shown  to  scale  very  well  to  the  case  of  100K  atoms.  In  order  to  compute  the 
success  probability  of  a  query,  [Raedt  et  al.  2007]  first  builds  a  monotone  DNF  formula 
(this  represents  the  proofs  of  the  query  when  probabilities  are  ignored),  and  then  uses 
a  BDD  based  approach  to  compute  the  probability.  The  approach  is  experimentally 
evaluated  on  biological  networks  showing  good  scalability. 

The  independence  assumption  is  frequently  made  in  many  applications  —  however, 
in  social  networks,  assuming  independence  of  node  properties  and/or  diffusions  can  be 
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dangerous  because  the  diffusive  process  explicitly  is  one  of  dependency  -  the  probabil¬ 
ity  of  vertex  A  being  infected  by  a  neighbor  is  directly  dependent  on  the  probabilities 
of  one  or  more  of  its  neighbors  being  infected.  We  note  that  [Raedt  et  al.  2007]  does  not 
provide  any  results  on  solving  social  network  optimization  problems. 

In  many  logics  that  incorporate  independence  assumptions  including  [Raedt  et  al. 
2007;  Sneyers  et  al.  2010;  Poole  2008],  the  probability  of  diffusion  from  neighbors  of 
a  vertex  to  a  vertex  are  computed  via  the  independence  assumption.  In  the  simplest 
sense,  consider  a  vertex  v  and  two  vertices  a,b  such  that  (a,  i>),  (b,v)  are  edges  in  the 
graph  and  suppose  there  are  no  other  edges  of  the  form  (— ,  v).  Suppose  we  know  that 
the  probability  of  v  being  infected  by  a  (resp.  b)  is  0.7  (resp.  0.5).  In  this  case,  the 
probability  of  infection  of  v  under  the  assumption  of  independence  is  0.7+0. 5— 0. 7x0.5  = 
0.85.  The  reason  independence  is  important  here  is  that  P(E  V  E ')  =  P (E)  +  P (E')  — 
P (E  A  E')  and  P {E  A  E')  =  P (E)  x  P ( Id )  only  when  the  events  E,E'  are  mutually 
independent. 

When  independence  is  not  assumed  between  E  and  E',  one  must  compute  P (E  A  E') 
by  either  solving  a  linear  program  or  via  some  other  method.  Dekhtyar  et  al.  [Dekht- 
yar  et  al.  1999;  Dekhtyar  and  Subrahmanian  2000]  developed  methods  to  not  only 
find  such  probabilities  when  the  independence  assumption  cannot  be  made,  but  also 
suggested  how  different  assumptions  on  the  relationships  between  events  (e.g.  pos¬ 
itive  correlation,  negative  correlation,  mutual  exclusion  and  independence)  could  be 
computed  via  hybrid  logic  programs.  For  example,  if  we  assume  that  an  arbitrary  tri¬ 
angular  co-norm12  [Bonissone  1987b]  ©  is  used  to  compute  the  (disjunctive)  probability 
that  vertex  v  is  infected  by  either  a  or  by  b,  then  we  can  express  the  diffusion  via  the 
GAP  rule: 


inf(v)  :  Vi  ©  V2  <-  inf  {a)  :  V\  A  inf(b)  :  V2. 

Thus  we  see  that  such  rules  can  capture  triangular  co-norms  (including  that  used  to 
compute  the  probability  of  a  disjunct  under  the  independence  assumption). 

However,  though  GAPs  can  be  more  expressive  than  many  languages  such  as  [Raedt 
et  al.  2007;  Sneyers  et  al.  2010;  Poole  2008],  there  is  no  guarantee  that  they  will  be 
more  “efficient”  or  more  “intuitive”  when  additional  assumptions  such  as  indepen¬ 
dence  are  made.  For  instance,  suppose  we  consider  a  more  complex  situation  where 
we  have  three  vertices  a,  6,  c  and  the  same  vertex  v  above.  And  suppose  we  have  edges 
(a,v),  (b,  v),  (c,  v)  in  our  graph  and  we  want  to  say  that  infection  propagation  is  inde¬ 
pendent.  In  this  case,  we  can  express  this  as  a  GAP  by  writing  the  rules  shown  below: 


inf(v )  :  V  4— 
inf(v)  :  Vi  ©  V2  f— 

inf(v )  :  V\  ©  V2  ©  V3  4- 


edge(X,v)  :  1  A  inf(X)  :  V. 
edge(X,v)  :  1  A  edge(Y,v)  :  1  A 
inf{X)  :  V\  A  inf(Y)  :f2Al/7. 
edge{X,v)  :  1  A  edge{Y,v)  :  1  A  edge(Z,v)  :  1  A 
inf{X)  :  Vi  A  inf{Y)  :  V2  A  inf(Z)  :  V3  A 
X  f-Y  \  Y  Z  K  X  ±  Z. 


12A  triangular  co-norm  is  a  function  ©  :  [0, 1]  x  [0, 1]  — ►  [0, 1]  stating  the  probability  of  computing  the  “or” 
of  two  events  whose  probabilities  are  known  and  are  provided  as  input  to  ©.  All  triangular  co-norms  satisfy 
the  following  axioms.  (Axl)  ©  is  associative  and  commutative.  (Ax2)  x  ©  0  =  x.  (Ax3)  ©  is  monotone,  i.e.  if 
x  <  x’  and  j/  <  y'  then  x  ©  y  <  x’  ©  y' .  This  axiom  says  that  when  the  probabilities  of  both  events  go  up 
(or  stay  the  same),  the  probability  of  the  “or”  can  only  go  up  (or  stay  the  same).  Triangular  co-norms  have 
been  extensively  studied  in  logic  programming  as  long  back  as  1988  [Subrahmanian  1988]  —  max(x ,  y)  and 
x  +  y  —  x  *  y  are  two  well  known  triangular  co-norms  —  and  there  are  many  others. 
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When  x®y  =  x  +  y  —  x*y,  then  the  above  rules  correspond  to  propagation  under  an 
independence  assumption  -  otherwise  ®  can  be  any  triangular  co-norm. 

In  this  case,  we  wrote  three  rules,  one  for  each  possible  number  of  infected  prede¬ 
cessors  of  v.  The  first  rule  covers  the  case  where  we  want  the  infection  to  pass  from 
exactly  one  predecessor  to  v.  The  second  rule  covers  all  possible  combinations  of  two 
predecessors  of  v  passing  the  infection  on.  The  third  rule  looks  at  the  case  where  all 
three  predecessors  pass  the  infection  along.  Though  this  GAP  is  seemingly  larger  and 
more  cumbersome  than  the  simple  conditional  probability  statement  governing  infec¬ 
tions  discussed  above,  we  can  show  that  indeed  the  third  rule  is  the  only  one  that  is 
needed  and  this  holds  for  any  triangular  co-norm  —  in  fact,  the  semantics  of  the  above 
GAP  is  identical  to  the  semantics  of  the  above  GAP  with  just  the  last  rule  (plus  the 
social  network).  Intuitively,  the  reason  is  that  the  probability  of  v  getting  the  infection 
from  all  three  is  always  greater  than  or  equal  to  the  probability  of  v  getting  it  from  just 
one  or  just  two  of  its  predecessors. 

We  now  generalize  the  observations  above.  When  we  apply  a  triangular  co-norm  ®  to 
a  set  {ari, . . .  ,Xk},  we  use  ®({xi, . . . ,  Xk})  as  short-hand  for  ©(aq,  ©({aq,  •  •  • ,  xn}))  which 
is  well  defined  as  all  triangular  co-norms  are  commutative  and  associative. 

The  following  proposition  says  a  triangular  co-norm  ®  applied  to  a  set  S'  always 
gives  a  value  no  smaller  than  the  value  obtained  by  applying  ®  to  a  subset  of  S'. 

PROPOSITION  7.1.  Let  S  and  S'  be  two  sets  of  elements  in  [0, 1],  and  ®  a  triangular 
co-norm.  If  S  C  S',  then  ®S  <  ©S'. 

PROOF.  Suppose  S  C  S'.  We  show  that  ©S  <  ©(£  U  AS)  for  any  AS  C  S'  -  S.  This 
is  shown  by  induction  on  the  cardinality  %  of  AS. 

Base  case  i  =  0.  Straightforward. 

Inductive  step.  Suppose  ©5  <  ©(S'  U  A Sf)  for  any  AS,  C  S'  -  S  such  that  A. S',  =  i. 
Let  A,S,+i  C  S'  —  S  be  such  that  |ASi+i|  =  i  +  1.  Consider  a  set  A  Si  s.t.  A  S*  C  AS,;+i 
and  |AS*|  =  i  and  let  e,  be  the  element  in  ASj+i  but  not  in  AS,.  Ax2  implies  that 
©(S  U  A  Si)  =  (©(S'  U  AS,))  ©  0.  By  the  associative  and  commutative  properties  of  ©  we 
have  ffi(S  U  ASi+i)  =  (®(S  U  A  Si))  ©  e».  As  0  <  eu  then  ®(S  U  AS,;)  <  ©(S  U  ASi+1) 
by  the  monotonicity  property.  As  ©S  <  ©(S  U  AS,)  (by  induction  hypothesis),  then 
©S  <  ®(SU  AS,;+i).  □ 

Thus,  in  order  to  deal  with  the  infection  scenario  discussed  above  it  suffices  to  write 
rules  of  the  form: 


inffV)  :  V\  ©  •  •  •  ©  Vn  edge(X\,V)  :  1  A  •  •  •  A  edge(Xn,  V)  :  1  A 

infiXf)  :  Vi  A  ...  A  inf{Xn)  :  Vn  A 
f\  Xj. 

l<i<j<n 

Note  that  one  such  rule  must  be  generated  for  each  value  n  s.t.  there  is  a  node  in  the 
social  network  with  in-degree  n. 

Thus,  though  GAPs  provide  a  general  method  to  express  a  vast  variety  of  both  prob¬ 
abilistic  and  non-probabilistic  diffusion  models,  some  of  these  methods  can  lead  to  an 
increase  in  the  size  of  the  GAP.  When  certain  probabilistic  assumptions  are  warranted 
(such  as  independence)  then  it  may  be  appropriate  to  use  the  techniques  for  solving  so¬ 
cial  network  optimization  problems  presented  in  this  paper  in  conjunction  with  frame¬ 
works  such  as  ProbLog  or  Independent  Choice  Logic  that  may  be  able  to  leverage  their 
assumptions  to  produce  good  solutions  under  those  assumptions.  However,  much  more 
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experimentation  is  required  to  understand  the  pros  and  cons  of  such  choices  and  we 
leave  this  to  future  work. 

There  are  a  few  papers  on  solving  optimization  problems  in  logic  programming.  The 
best  of  these  is  constraint  logic  programming  [Van  Hentenryck  2009]  which  can  embed 
numerical  computations  within  a  logic  program.  However,  CLP  does  not  try  to  find  so¬ 
lutions  to  optimization  problems  involving  semantics  structures  of  the  program  itself. 
Important  examples  of  constraint  logic  programming  include  [Friihwirth  1994;  Man- 
carella  et  al.  1999]  where  annotated  LP  is  used  for  temporal  reasoning,  [Leone  et  al. 
2004]  assumes  the  existence  of  a  cost  function  on  models.  They  present  an  analysis  of 
the  complexity  and  algorithms  to  compute  an  optimal  (w.r.t.  the  cost  function)  model 
of  a  disjunctive  logic  program  in  3  cases:  when  all  models  of  the  disjunctive  logic  pro¬ 
gram  are  considered,  when  only  minimal  models  of  the  disjunctive  logic  program  are 
considered,  and  when  stable  models  of  the  disjunctive  logic  program  are  considered. 
In  contrast,  in  this  paper,  there  are  two  differences.  First,  we  are  considering  GAPs. 
Second,  we  are  not  looking  for  models  of  a  GAP  that  optimize  an  objective  function  - 
rather,  we  are  trying  to  find  models  of  a  GAP  together  with  some  additional  informa¬ 
tion  (namely  some  vertices  in  the  social  network  for  which  a  goal  atom  g(v )  :  1  is  added 
to  the  GAP)  which  is  constrained  (at  most  k  additional  atoms)  so  that  the  resulting 
least  fixpoint  has  an  optimal  value  w.r.t.  an  arbitrary  value  function.  In  this  regard,  it 
has  some  connections  with  abduction  in  logic  programs  [Eiter  and  Gottlob  1995],  but 
there  is  no  work  on  abduction  in  annotated  logic  programs  that  we  are  aware  of  or 
work  that  optimizes  an  arbitrary  objective  function. 

Our  paper  builds  on  many  techniques  in  logic  programming.  It  builds  upon  non¬ 
ground  fixpoint  computation  algorithms  proposed  by  [M.  Falaschi  and  Palamidessi 
1988]  and  later  extended  for  stable  models  semantics  [Gottlob  et  al.  1996;  Eiter  et  al. 
1997],  and  extends  these  non-ground  fixpoint  algorithms  to  GAPs  and  then  applies  the 
result  to  define  the  SNDOP-Mon  algorithm  to  find  answers  to  SNDOP  queries  which, 
to  the  best  of  our  knowledge,  have  not  been  considered  before. 

7.2.  Work  in  Social  Networks 

[Kempe  et  al.  2003]  is  one  of  the  classic  works  in  this  area  where  a  generalized  diffu¬ 
sion  framework  for  social  networks  is  proposed.  This  work  presents  two  basic  diffusion 
models:  the  linear  threshold  and  independent  cascade  models.  Both  models  utilize  ran¬ 
dom  variables  to  specify  how  the  diffusion  propagates.  These  models  roughly  resemble 
non-deterministic  versions  of  the  tipping  and  cascading  models  presented  in  Section  4 
of  this  paper.  Neither  model  allows  for  a  straightforward  representation  of  multiple 
vertex  or  edge  labels  as  this  work  does.  Additionally,  unlike  this  paper,  where  we  use  a 
fixed-point  operator  to  calculate  how  the  diffusion  process  unfolds,  the  diffusion  mod¬ 
els  of  [Kempe  et  al.  2003]  utilize  random  variables  to  define  the  diffusion  process  and 
compute  the  expected  number  of  vertices  that  have  a  given  property.  The  authors  of 
[Kempe  et  al.  2003]  only  approximate  this  expected  value  and  leave  the  exact  computa¬ 
tion  of  it  as  an  open  question.  Further,  they  provide  no  evidence  that  their  approxima¬ 
tion  has  theoretical  guarantees.  Moreover,  Lemma  7.1  and  the  discussion  immediately 
preceding  it  show  how  the  linear  threshold  model  mentioned  in  Kleinberg[Kempe  et  al. 
2003]  can  be  expressed  via  GAPs  with  no  loss  of  generality  (but  with  an  increase  in 
the  number  of  rules  in  the  GAP  that  can  affect  performance). 

The  more  recent  work  of  [Chen  et  al.  2010]  showed  this  computation  to  be  #P-hard 
by  a  reduction  from  S-T  connectivity,  which  has  no  known  approximation  algorithm. 
This  suggests  that  a  reasonable  approximation  of  the  diffusion  process  of  [Kempe  et  al. 
2003]  may  not  be  possible.  This  contrasts  sharply  with  the  fixed-point  operator  of  [Kifer 
and  Subrahmanian  1992],  which  can  be  solved  in  PTIME  under  reasonable  assump¬ 
tions  (which  are  present  in  this  paper).  [Kempe  et  al.  2003]  focus  on  the  problem  of 
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finding  the  “most  influential  nodes”  in  the  graph  -  which  is  similar  in  intuition  to  a 
SNDOP  query.  However,  this  problem  only  looks  to  maximize  the  expected  number  of 
vertices  with  a  given  property,  not  a  complex  aggregate  as  a  SNDOP  query  does.  Fur¬ 
ther,  the  approximation  guarantee  presented  for  the  “most  influential  node”  problem 
is  contingent  on  an  approximation  of  the  expected  number  of  vertices  with  a  certain 
property,  which  is  not  shown  (and,  as  stated  earlier,  was  shown  by  [Chen  et  al.  2010] 
to  be  a  #P-hard  problem). 

In  short,  the  frameworks  of  [Chen  et  al.  2010]  and  [Kempe  et  al.  2003]  cannot  handle 
arbitrary  aggregates  nor  vertex  conditions  nor  edge  and  vertex  predicates  nor  edge 
weights  as  we  do.  Nor  can  they  define  an  objective  function  using  a  mix  of  the  aggregate 
and  the  go{— )  predicate  specified  in  the  definition  of  a  SNDOP  query. 

Another  well-studied  related  problem  in  computer  science  is  the  “target  set  selec¬ 
tion”  problem  [Dreyer  and  Roberts  2009;  Chen  2009;  Chiang  et  al.  2011].  This  problem 
assumes  a  deterministic  tipping  model  and  seeks  to  find  a  set  of  vertices  of  a  certain 
size  that  optimizes  the  final  number  of  adopters.  Although  approximation  algorithms 
for  this  problem  have  been  discovered,  there  is  no  evidence  that  they  scale  well  for 
large  datasets.  Further,  an  easy  modification  of  Diffusion  Model  4.1  allows  for  this 
problem  to  be  represented  in  our  framework.  While  target  set  selection  can  be  encoded 
as  an  SNDOP  query,  a  straightforward  encoding  of  an  SNDOP  query  into  target  set 
selection  is  unlikely.  This  is  because  the  target  set  selection  problem  does  not  consider 
multiple  vertex  and  edge  labels  nor  seeks  to  optimize  a  complex  aggregate. 

8.  CONCLUSION 

Social  networks  are  proliferating  rapidly  and  have  led  to  a  wave  of  research  on  diffu¬ 
sion  of  phenomena  in  social  networks.  In  this  paper,  we  introduce  the  class  of  Social 
Network  Diffusion  Optimization  Problems  (SNDOPs  for  short)  which  try  to  find  a  set 
of  vertices  (where  each  vertex  satisfies  some  user  specified  vertex  condition)  that  has 
cardinality  k  or  less  (for  a  user-specified  k  >  0)  and  that  optimizes  an  objective  func¬ 
tion  specified  by  the  user  in  accordance  with  a  diffusion  model  represented  via  the 
well-known  Generalized  Annotated  Program  (GAP)  framework.  We  have  used  specific 
examples  of  SNDOP  queries  drawn  from  product  adoption  (cell  phone  example)  and 
epidemiology. 

The  major  contributions  of  this  paper  include  the  following: 

—  We  showed  that  answering  SNDOP  queries  is  NP-hard  and  identified  the  complexity 
classes  associated  with  related  problems  (under  various  restrictions).  We  showed 
that  the  complexity  of  counting  the  number  of  solutions  to  SNDOP  queries  is  #P- 
complete. 

—  We  proved  important  results  showing  that  there  is  no  polynomial-time  algorithm 
that  computes  an  a -approximation  to  a  SNDOP  query  when  a  > 

—  We  described  how  various  well-known  classes  of  diffusion  models  (cascading,  tipping, 
homophilic)  from  economics,  product  adoption  and  marketing,  and  epidemiology  can 
be  embedded  into  GAPs. 

—  We  presented  an  exact-algorithm  for  solving  SNDOP  queries  under  the  assumption 
of  a  monotonic  aggregate  function. 

—  We  proved  that  SNDOP  queries  are  guaranteed  to  be  submodular  when  the  GAP 
representing  the  diffusion  model  is  linear  and  the  aggregate  is  positive-linear.  We 
were  able  to  leverage  this  result  to  develop  the  GREEDY-SNDOP  algorithm  that 
runs  in  polynomial-time  and  that  achieves  the  best  possible  approximation  ratio  of 
ef1  for  solving  SNDOPs. 

—  We  develop  the  first  implementation  for  solving  SNDOP  queries  and  showed  it  could 
scale  to  a  social  network  with  over  7000  vertices  and  over  103,000  edges.  Our  exper- 
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iments  also  show  that  SNDOP  queries  over  tipping  models  can  generally  be  solved 
more  quickly  than  SNDOP  queries  over  cascading  models. 

Much  work  remains  to  be  done  and  this  paper  merely  represents  a  first  step  towards 
the  solution  of  SNDOP  queries.  Clearly,  we  would  like  to  scale  SNDOP  queries  to  social 
networks  consisting  of  millions  of  vertices  and  billions  of  edges.  This  will  require  some 
major  advances  and  represents  a  big  challenge. 

ELECTRONIC  APPENDIX 

The  electronic  appendix  for  this  article  can  be  accessed  in  the  ACM  Digital  Library. 
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A.  PROOFS  FOR  SECTION  3 
A.1.  Proof  of  Proposition  3.9 

If  agg  is  a  positive-linear  aggregate,  then  it  is  a  monotonic  aggregate. 

Proof.  Follows  directly  from  Definitions  3. 7-3.8.  □ 

A.2.  Proof  of  Proposition  3.11 

Let  Q  =  {agg,  VC,  k,  gi{V),  go{V))  be  a  SNDOP  query  which  is  not  zero-starting  w.r.t. 
a  social  network  S  and  a  GAP  II  D  Ifs,  and  where  agg  is  positive-linear.  Let  agg' {X)  = 
agg{X)  —  valued).  Then,  Q'  =  {agg'  ,VC,k,gi{V),go{V))  is  a  SNDOP  query  which  is 
zero-starting  w.r.t.  S  and  II,  ans(Q)  =  ans(Q'),  and  agg'  is  positive-linear. 

PROOF.  The  fact  that  Q'  is  zero-starting  and  agg'  is  positive-linear  follows  directly 
from  the  construction.  It  is  easy  to  see  that  pre_ans(Q)  =  pre_ans(<5')  as  the  set  of  pre¬ 
answers  depends  on  II,  S,  VC,  k,  gi{V),  and  go{V),  which  do  not  change  for  the  two 
queries.  We  will  use  value'  to  refer  to  the  value  function  for  Q'. 

ans(Q)  C  ans(Q').  Reasoning  by  contradiction,  assume  that  there  is  an  answer  Vans  £ 
ans(Q)  s.t.  Vans  ^  ans((5').  Varas  e  pre_ans(Q),  which  entails  Vans  £  pre_ans(Q').  Since 
Vans  is  a  pre-answer  to  Q'  but  not  an  answer,  then  there  exists  Vans  £  pre_ans(Q')  s.t. 

value' (V'ans)  >  value' (Vans)-  Then  Vans  £  pre_ans(Q)  and  value{V'ans )  =  value' (V'ons)  + 
value{$)  >  valueiy ans)  =  value' (V ans)  +  value{$),  that  is  Vans  is  not  an  answer  to  Q, 
which  is  a  contradiction. 

ans(Q)  £)  ans(Q').  A  reasoning  analogous  to  the  one  above  can  be  applied.  □ 

A. 3.  Proof  of  Lemma  3.13 

Given  a  SNDOP  query  Q  =  {agg,  VC,  k,  gi{V),  go{V)),  a  social  network  S,  and  a  GAP 
II  D  ILs,  if  agg  is  monotonic  (Definition  3.7),  then  value  (defined  as  per  Q  and  II)  is 
monotonic. 

PROOF.  By  the  definition  of  T,  the  annotation  of  any  vertex  atom  montonically  in¬ 
creases  as  we  add  more  facts  of  the  form  gi(V)  :  1  •<—  to  the  logic  program.  Hence,  by 
the  monotonicity  of  agg,  the  statement  follows.  □ 

A.4.  Proof  of  Lemma  3.14 

Given  a  SNDOP  query  Q  =  {agg,  VC,k,gi{V),go{V)),  a  social  network  S,  and  a  GAP 
n  3  Ifs,  if  Q  is  a-priori  VC  w.r.t.  S  and  II,  then  the  set  of  pre-answers  is  a  uniform 
matroid. 
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Proof.  Let  Vcond  be  the  set  of  veritces  in  V  s.t.  for  each  v  £  Vcond, 

{ 9i(v )  :  1}  U  \JPredetvert(v){Pred(v )  :  !}  H  V"C7[V/i>]. 

CLAIM  1:  For  an  a-priori  VC  SNDOP  query,  any  subset  of  Vcone/  of  cardinality  <  fc  is  a 
pre-answer. 

Suppose,  BWOC,  some  subset  of  V'  C  \cond  of  cardinality  <  k  is  not  a  pre-answer. 
Obviously,  all  such  subsets  meet  the  cardinality  requirement.  Then,  there  must 
exist  some  v'  £  V'  s.t.  {gi(y')  :  1}  U  UpredeWK)^6^')  :  ^  ^  ^[V/u'].  By 
Definition  3.12,  this  is  a  contradiction. 

CLAIM  2:  There  is  no  subset  V'cV  where  V'  n  (V  —  Vcond)  ^  0  that  is  a  pre-answer. 
Clearly,  this  would  have  an  element  that  would  not  satisfy  the  a-priori  VC,  and  hence, 
not  be  a  pre-answer. 

Proof  of  lemma:  Any  subset  of  size  <  k  of  Wcond  is  a  uniform  matroid  by  definition.  Also, 
from  claims  1-2,  we  know  that  this  family  of  sets  also  corresponds  exactly  with  the  set 
of  pre-answers.  Hence,  the  statement  of  the  lemma  follows.  □ 

A. 5.  Proof  of  Theorem  3.15 

Given  a  SNDOP  query  Q  =  ( agg ,  VC,  k,  gi(V),  go (V )),  a  social  network  S,  and  a  GAP 
n  2  ILs,  if  the  following  criteria  are  met: 

—  II  is  a  linear  GAP, 

—  Q  is  a-priori  VC,  and 

—  agg  is  positive-linear, 

then  value  (defined  as  per  Q  and  II)  is  sub-modular. 

In  other  words,  for  Vcond  =  {v'\v'  £  V  and  (Ifs  U  {gi(v')  :  1}  |=  VC[V /v'})},  if  Vi  C 
V2  C  \/cond  and  v  £  Wcond  V2,  then  the  following  holds: 

valueiy i  U  {«})  —  value (y i)  >  value (\f 2  U  {r})  —  valued 2) 

PROOF.  CLAIM  1:  For  some  V',  if  Az  :  p,;  £  T{nu  {gi{y')-.i*-  \  ^'ev'}  s.t.  there  is  no 
gl  >  ^  where  A,  :  g!i  £  T{Hu  {SJK):i<-  |  »'ev'}  then,  there  exists  a  polynomial  of  the 
following  form: 

fi(X  1,  •  •  • ,  V|v|)  =  Ai  •  Xi  +  . . .  +  A|V|  •  A"|V|  +  A|V|+i 

s.t.  if  each  A,;  where  V  £  V'  is  set  to  1  and  each  Xt  where  V  £  V'  is  set  to  0,  then 

fi{Xi, . . . ,  X|v|)  =  Hi. 

(Proof  of  claim  1):  Consider  all  of  the  rules  in  {II  U  {gi(v')  :  1  £- .  If 

Ai  :  im  £  T{nu{s7(«'):i<- ] v'e\i'}’  then  there  must  exist  a  rule  that  causes  the  an¬ 
notation  of  Ai  to  equal  //., .  As  the  annotation  in  all  rules  is  a  linear  function,  we  can 
easily  re-write  it  in  the  above  form,  based  on  the  presence  of  annotated  atoms  in  the 
body  formed  with  the  goal  predicate. 

CLAIM  2:  For  some  V',  if  A,  :  /i,  £  T{nu  |  v/eV'}  t  j,  s.t.  there  is  no  /i'  >  /i,: 

where  A,  :  p'  £  T{II  U  {gi(v')  :  1  <-  |  v'  £  V'}  f  j  then,  there  exists  a  polynomial  of  the 
following  form: 

fi(X  1,  •  •  •  ,-XjV|)  =  Ai  •  X\  +  . . .  +  A|V|  •  A"|V|  +  A|V|+i 

s.t.  if  each  A,;  where  V,  £  V'  is  set  to  1  and  each  Xt  where  V  £  V'  is  set  to  0,  then 

fi(X  1,  •  •  • ,  A|V|)  =  Hi. 

(Proof  of  claim  2):  We  will  show  that  if  the  statement  of  the  claim  is  true  for  the 
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j  —  1  application  of  T,  then  it  is  true  for  application  j.  The  proof  of  the  claim  relies 
on  this  subclaim  along  with  claim  1.  If  the  claim  holds  for  application  j  —  1,  then  for 
each  annotated  atom  A\  :  there  is  an  associated  polynomial  as  per  the  statement. 

Consider  the  rule  that  fires  in  the  jth  application  of  the  operator  that  causes  rule  A,  to 
be  annotated  with  We  can  re-write  this  as  a  polynomial  of  the  above  form,  simply  by 
substituting  the  polynomial  for  each  annotation  associated  with  A[  from  the  previous 
iteration.  As  all  of  the  polynomials  are  being  substituted  into  variable  positions  of 
a  polynomial,  the  result  is  still  a  polynomial,  which  can  easily  be  re-arranged  to 
resemble  that  of  the  claim. 

CLAIM  3:  For  some  V',  if  At  :  p,;  e  l/p(T{nu  {gi |  „'6V'})>  s.t.  there  is  no  /i'  >  p,; 
where  A,  :  p'  £  (/p(T{II  U  {gi(v')  :  1  <-  \v'  £  V'})  then,  there  exists  a  polynomial  of 
the  following  form: 

fi(X i, . . . ,  X|v|)  =  Ai  •  Xi  +  . . .  +  A|v|  ■  AjV|  +  A|v|+i 

s.t.  if  each  X4  where  V,,  £  V'  is  set  to  1  and  each  X,  where  V  ^  V'  is  set  to  0,  then 

fi(X i,  .  .  .  ,  X|V|)  =  Pi. 

(Proof  of  claim  3):  Follows  directly  from  claims  1-2. 

CLAIM  4:  For  some  V  i  C  V,  there  exists  a  polynomial  of  the  following  form: 

fi(X i,  •  •  •  ,X|V|)  =  Ai  •  Xi  +  . . .  +  A|V|  ■  X|V|  +  A|V|+i 

s.t.  if  each  Xt  where  V  £  V'  is  set  to  1  and  each  X,  where  Vi  ^  V'  is  set  to  0,  then 
fi(X i, . . .  ,X|V|)  =  value{Vi). 

(Proof  of  claim  4):  Consider  all  atoms  formed  with  predicate  goal  in  the  Ifp  where 
the  annotation  is  maximum.  By  claim  3,  each  is  associated  with  a  polynomial.  A 
positive-linear  combination  of  all  these  polynomials  is  a  polynomial  of  the  form  in  this 
claim,  and  is  equivalent  to  value. 

CLAIM  5:  valuey 4  U  {r})  —  value(\/i)  >  valuey 2  U  {r})  —  valuey  2). 

(Proof  of  claim  5):  By  the  definition  of  value,  as  the  query  is  a-priori  VC,  we  know  that 
value  is  defined  on  all  subsets  of  Vcond. 

We  define  the  following  polynomial  functions,  which  are  associated  with  value  for  the 
various  subsets  of  V  in  claim  5  (with  some  re-arrangement,  Greek  letters  resemble 
constants,  X  variables  can  be  either  0  or  1  -  signifying  if  the  associated  subscript  is 
includes  in  the  associated  set). 

(1)  fi(X\/1,  X{„},  Xv2-Vi)  =  ot  1  •  XVl  +  /3\  ■  Xw  +  7 !  •  Xv2-Vi  +  Ai 

valueiy  1  U  {r})  =  /i(l,  1, 0)  =  a\  +  +  Ai 

(2)  /2(XVl,XM)Xv2_Vl)  =  « 2  •  VVl  +  P2  ■  Xw  +  72  •  Xy2-Vi  +  A2 

valuey  1)  =  /2(  1, 0, 0)  =  a2  +  A2 

(3)  /3(Xv1,X{„},Xv2_Vi)  =  a3  ■  X\x  +  /?3  •  X{„}  +  73  •  Xy2-v  1  +  A3 
value(\f 2  U  {r})  =  /3(1, 1, 1)  =  a3  +  /33  +  73  +  A3 

(4)  /4(XVl, X{„j, Xv2_Vx)  =  a4  ■  XVl  +  /?4  •  +  74  •  Xy2-v  1  +  A4 

valuey  2)  =  /4(1, 0, 1)  =  0L4  +  74  +  A4 

CLAIM  5.1  :  a4  +  74  +  A4  >  a2  +  j3  +  A2 

(Proof  of  claim  5.1):  We  note  that  the  constants  in  the  f,  ’s  defined  earlier  all  correspond 
directly  with  constants  seen  in  rules.  Hence,  as  /4(1,0, 1)  corresponds  with  the  maxi¬ 
mum  possible  value  for  valuey 2),  there  can  be  no  constants  other  than  a4, 74,  A4  that 
sum  to  a  value  greater  than  valuey 2).  The  statement  of  claim  5.1  immediately  follows. 
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CLAIM  5.2:  oc\  /3i  -f-  Ai  A  0:3  +  @3  -f-  A3 

(Proof  of  claim  5.2):  Mirrors  claim  5.1,  (in  this  case,  value(\/i  U  {n})  is  the  maximum 
possible  value  of  /i(l,  1,0)). 

(Completion  of  claim  5  /  theorem):  Suppose,  BWOC,  claim  5  is  not  true.  Then,  it  must 
be  the  case  that 

valueiy  1  U  {t>})  —  valueiy  1)  <  value(V 2  U  {r})  —  valueiy 2) 

This  would  imply: 

Q-i  +  /?i  +  Ai  +  OL4  +  74  +  A4  <  <23  +  /?3  +  73  +  A3  +  012  +  A2 

By  claim  5.2,  we  have  the  following: 

OL4  +  74  +  A4  <  73  +  a2  +  A2 

Which  contradicts  claim  5.1.  The  statement  of  the  theorem  follows.  □ 

A. 6.  Proof  of  Theorem  3.17 

Finding  an  answer  to  a  SNDOP  query  Q  =  ( agg,VC,k,gi(V),go{V ))  (w.r.t.  a  social 
network  S  and  a  GAP  II  D  1 15)  is  NP-hard  (even  if  II  is  a  linear  GAP,  VC  =  0,  agg  = 
SUM  and  value  is  zero-starting). 

PROOF.  The  known  NP-hard  problem  of  max  /.--cover  [Feige  1998]  as  follows. 

MAX  K-COVER 

INPUT:  Set  of  elements,  S  and  a  family  of  subsets  of  S,  H  =  {Hi, . . .  ,Hmax},  and 
positive  integer  K. 

OUTPUT:  <  K  subsets  from  'H  s.t.  the  union  of  the  subsets  covers  a  maximal  number 
of  elements  in  S. 

We  shall  make  the  following  assumptions  of  MAX-K-COVER 

(1)  \H\  >  I< 

(2)  There  is  no  H  e  H  s.t.  H  =  0 

CONSTRUCTION:  Given  MAX  K-COVER  input  S,H,K  we  create  a  SNDOP-query 
as  follows. 

(1)  Set  up  social  network  S  as  follows: 

(a)  EP  =  {edge} 

(b)  VP  =  {vertex} 

(c)  For  every  element  of  'H,  and  every  element  of  S,  we  create  an  element  of  V.  We 
shall  denote  subsets  of  V,  Vs  and  Vu  as  the  vertices  corresponding  with  S  and 
'H  respectively.  For  some  s  €  S,vs  is  the  corresponding  vertex.  For  some  //  e  'H, 
vH  is  the  corresponding  vertex.  Note  that  set  V  =  Vs  U  Vn 

(d)  For  each  H  e  H,  if  s  G  H  draw  add  edge  (vh,  vs)  to  set  E 

(e)  For  each  v  £  V,  £vert{v)  =  vertex 

(f)  For  each  ( v,v' )  €  E,  £edge{v,v')  =  edge 

(g)  For  each  ( v,v ')  e  E,  w(v,v')  =  1 

(2)  Set  up  program  II  as  follows: 

(a)  Embed  S  into  II. 

(b)  Add  diffusion  rule  vertexiV )  :  X  vertex(V')  :  X  A  edge(V' ,  V)  :  1  to  II 

(3)  Set  up  SNDOP-query  Q  as  follows: 

(a)  agg  =  SUM 

(b)  VC  =  0 

(c)  k  =  K  (the  K  from  SET_COVER) 
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(d)  g  =  vertex 

Additionally,  we  will  use  the  following  notation: 

(1)  V'  is  a  pre-answer  to  the  constructed  query 

(2)  value(V')  is  the  value  of  the  constructed  query  for  pre-answer  l  " 

(3)  V'ans  is  an  answer  to  the  constructed  query 

CLAIM  1:  The  construction  can  be  performed  in  PTIME. 

Straightforward. 

CALIM  2:  Program  II  is  a  linear  GAP 
Follows  directly  form  Definition  3.6. 

CLAIM  3:  An  answer  V'ans  to  the  SNDOP  query  cannot  contain  a  vertex  in  vs  e  Vs 
and  a  vertex  in  vh  £  Vu  s.t.  s  €  H. 

BWOC,  an  optimal  solution  could  have  an  element  vs  as  described  in  the  claim.  By 
assumption  1,  there  are  more  than  K  elements  in  Vu  and  all  of  them  have  an  edge  to 
some  element  of  Vs  by  assumption  2.  It  is  obvious  that  vs  will  be  annotated  with  a  1 
in  the  fixed  point,  and  that  no  elements  of  Vu  —  V'}ns  will  be  annotated  with  1  in  the 
fixed  point.  Hence,  we  can  pick  any  element  of  Vu  —  V'ans  and  val  ue  will  be  at  least  one 
greater  than  the  “optimal”  solution  -  hence  a  contradiction. 

CLAIM  4:  If  an  answer  V'ms  n  V's  ^  0,  then  we  can  construct  an  alternative  optimal 
solution  such  that  V'ms  n  Vs  =  0. 

As  no  element  in  V'ans  n  Vs  ^  0  has  an  outgoing  neighbor,  and  by  assumption  1,  we  can 
be  assured  that  | V'ms  -  Vu\  >  |Va'„s  n  Vg|,  we  can  replace  the  elements  of  V'ans  n  Eg  in 
Vans  with  elements  from  V'ans  —  Vu  and  still  be  ensured  of  an  optimal  solution. 

CLAIM  5:  Given  a  set  R'  C  R  that  ensures  an  optimal  solution  to  MAX-K-COVER, 
we  can  construct  an  optimal  V'ans  to  the  SNDOP  query. 


CASE  1  (claim  5):  \R'\  =  K. 

Let  OPT  be  the  number  of  elements  of  S  covered  in  the  optimal  solution  of  MAX- 
K-COVER.  For  each  H  e  R' ,  we  pick  the  corresponding  element  of  Vu-  Obviously, 
value{y^ns)  =  OPT  +  K.  Suppose,  we  could  pick  a  different  element  of  V  and  get  a 
solution  with  a  higher  value.  As  no  element  of  S  has  an  outgoing  edge,  replacing  one 
of  the  elements  from  the  constructed  set  with  one  of  these  will  not  ensure  a  greater 
solution.  If  we  could  pick  an  element  from  Vu  —  Vans’  than  this  would  obviously  imply 
a  solution  to  MAX-K-COVER  s.t.  more  than  OPT  elements  of  S  are  covered  -  clearly 
this  is  a  contradiction  as  'H'  is  an  optimal  cover. 

CASE  2  (claim  5):  \R'\  <  K. 

Create  R"  with  all  of  the  elements  of  'H'  and  K  —  \'H'\  elements  of  R  —  R' .  Clearly,  this 
is  also  an  optimal  solution  to  MAX-K-COVER  (as  cardinality  is  not  optimized,  just 
needs  to  be  below  K).  We  can  now  apply  case  1  of  this  claim. 

CLAIM  6:  Given  V('ns,  we  can  constructively  create  a  subset  of  R  that,  if  picked, 
ensures  an  optimal  solution  to  MAX-K-COVER. 

CASE  1  (claim  6):  V'ns  C  Vu 

Simply  pick  each  H  associated  with  each  vh  £  Vans-  Let  OPT’  =  value(V^ns)  note  that 
OPT'  =  K  +  SPREAD  where  SPREAD  corresponds  with  the  number  of  1-annotated 
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elements  of  Vs  in  the  fixed  point.  If  there  is  a  different  subset  of  'H  that  can  be  picked, 
(i.e.  a  more  optimal  solution  to  MAX-K-COVER),  then  we  can  create  a  solution  to  the 
SNDOP  query  where  some  SPREAD'  >  SPREAD  elements  of  Vs  become  annotated 
with  1  in  the  fixed  point.  Clearly,  this  would  imply  a  more  optimal  solution  to  the 
SNDOP  query  -  a  contradiction. 

CASE  2  (claim  6):  Vs  -  V'ans  ±  0 

From  this  solution,  we  can  use  claim  4  to  create  an  optimal  solution  s.t.  case  1  applies. 
The  proof  of  the  theorem  follows  directly  from  claims  5-6.  □ 

A.7.  Proof  of  Theorem  3.18 

Given  a  SNDOP  query  Q  =  ( agg,VC ,  k.  gi(V),  go(V)),  a  social  network  S,  a  GAP  II  D 
n5,  and  a  real  number  target,  the  problem  of  checking  whether  there  exists  a  pre¬ 
answer  V'  s.t.  value(V)  >  target  is  in  NP  under  the  assumptions  that  agg  and  the 
functions  in  T  are  polynomially  computable,  and  II  is  ground. 

Proof. 

CLAIM  1:  SNDOP-DEC  is  NP-hard. 

We  do  this  by  a  reduction  from  SET_COVER. 

CONSTRUCTION:  Given  instance  S,R,K  of  SET_COVER,  we  create  K  instances  of 
SNDOP-DEC,  each  identified  with  index  %  e  [1,  K ],  that  each  use  the  same  construction 
used  to  show  the  NP-hardness  of  a  SNDOP  query  with  the  following  two  exceptions: 

—  Set  k  in  SNDOP-DEC  to  i 

—  Set  target  in  SNDOP-DEC  to  i  +  |S| 

CLAIM  1.1:  The  construction  can  be  performed  in  PTIME. 

Straightforward.  CLAIM  1.2:  If  there  is  a  solution  to  the  set  cover  problem,  at  least 
one  of  the  constructed  instances  of  SNDOP-DEC  will  return  “yes.” 

Suppose,  that  there  is  a  solution  to  the  set-cover  problem,  that  causes  the  selection 
of  m  elements  of  'H  (where  m  <  K).  By  the  construction,  there  exists  an  instance  of 
SNDOP-DEC  such  that  target  =  m  +  |5|  and  k  =  m.  We  simply  pick  the  k  vertices 
in  Vu  corresponding  with  the  covers,  and  by  the  construction,  after  running  II,  all  of 
the  vertices  in  Vs  will  have  an  annotation  to  the  vertex  atoms  formed  by  marked  of  1. 
Hence,  the  aggregate  will  be  m  +  l^l  -  which  is  greater  than  target,  so  that  instance  of 
SNDOP-DEC  returns  “yes.” 

CLAIM  1.3:  If  there  is  no  solution  to  the  set  cover  problem,  all  of  the  instances  of 
SNDOP-DEC  will  return  “no.” 

Suppose  there  is  no  solution  to  SET  COVER  and  one  of  the  constructed  instances  of 
SNDOP-DEC  returns  “yes.”  Then,  for  some  i  e  [1  ,K\,  there  are  i  vertices  that  can 
be  picked  to  change  the  annotation  of  the  vertex  vertex  atoms  to  ensure  that  the 
aggregate  is  greater  than  or  equal  to  i  +  |£|.  As,  at  most,  only  i  vertex  atoms  can  be 
picked,  and  only  atoms  in  Vs  can  change  annotation  due  to  II,  all  i  vertices  associated 
with  the  vertex  atoms  must  be  in  Vu  to  ensure  that  we  have  the  most  possible  vertex 
atoms  formed  with  vertex  that  have  a  non-zero  annotation.  However,  in  order  for  all 
of  the  vertices  in  Vs  to  have  the  annotations  of  the  associated  vertex  vertex  atom 
increase  to  1,  there  must  be  at  least  one  incoming  edge  to  each  element  of  Vs  from  one 
of  the  i  atoms  from  Vy  ■  By  how  S  is  constructed,  this  would  imply  a  set-cover  of  size  i, 
which  would  be  a  contradiction. 

PROOF  OF  CLAIM  1:  Follows  directly  from  claims  1.1-1. 3. 
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CLAIM  2:  SNDOP-DEC  is  in-NP  (with  the  conditions  in  the  statement). 

Suppose,  we  are  given  a  set  V' .  We  can  easily  verify  this  solution  in  PTIME  as  follows: 
(i)  verify  V'  is  a  valid  pre-answer  can  easily  be  done  in  PTIME  by  checking  that  \  V'\  <  k 
and  that  W  £  V' ,  VC(v'  )  is  true,  (ii)  by  the  assumptions  about  agg  and  the  functions 
in  T,  we  can  compute  valuefV')  in  PTIME  as  well,  the  statement  follows.  □ 

A. 8.  Proof  of  Theorem  3.20 

Answering  a  SNDOP  query  Q  =  ( agg ,  l'' C,  k ,gi(  V ) .  go  ( V ) )  (w.r.t.  a  social  network  S 
and  a  GAP  II  D  I I5)  cannot  be  approximated  in  PTIME  within  a  ratio  of  - — -  +  e  for 
some  e  >  0  (where  e  is  the  inverse  of  the  natural  log)  unless  P  =  NP  -  even  if  II  is  a 
linear  GAP,  VC  =  0,  agg  =  SUM  and  value  is  zero-starting. 

PROOF.  Suppose,  BWOC,  there  is  an  a-approximation  algorithm  for  an  SNDOP 
query.  Hence,  we  can  approximate  value  returned  by  SNDOP  within  a  factor  of 
1  —  1/e  +  e  for  some  e  >  0.  Using  the  MAX-K-COVER  reduction  in  Theorem  3.17,  for 
SNDOP  answer  V'ms ,  the  cardinality  of  the  covered  elements  of  S  in  MAX-K-COVER 
is  va,lue(V'lns)  —  K.  Hence,  this  approximation  algorithm  would  provide  a  solution  to 
MAX-K-COVER  within  a  factor  ofl  —  1/e  +  e  for  some  e  >  0.  By  Theorem  5.3  of  [Feige 
1998],  this  would  imply  P==NP,  which  contradicts  the  statement  of  the  theorem.  We 
recall  that  Theorem  5.3  of  [Feige  1998]  states  that  for  any  e  >  0,  MAX-K-COVER  can¬ 
not  be  approximated  in  polynomial  time  within  a  ratio  of  (1  —  1/e  +  e),  unless  P==NP. 
The  proof  in  [Feige  1998]  uses  a  reduction  from  approximating  3SAT-5,  which  is  the 
problem  of  determining  the  maximum  number  of  clauses  that  can  be  simultaneously 
satisfied  in  a  CNF  formula  with  n  variables  and  5n/3  clauses,  in  which  every  clause 
contains  exactly  three  literals,  every  variable  appears  in  exactly  five  clauses,  and  a 
variable  does  not  appear  in  a  clause  more  than  once.  □ 

A.9.  Proof  of  Theorem  3.21 

Counting  the  number  of  answers  to  a  SNDOP  query  Q  (w.r.t.  a  social  network  S  and  a 
GAP  n  D  ILs)  is  #P-complete. 

Follows  directly  from  Lemmas  A.l  and  A.  2. 

LEMMA  A.l.  The  counting  version  of  the  SNDOP  query  answering  problem  (we 
shall  call  it  #SNDOP)  is  ffP-hard. 

Proof.  We  now  define  the  known  /IP-Complete  problem,  MONSAT  [Roth  1996] 
and  a  variant  of  it  used  in  this  proof: 

Counting  K-Monotone  CNF  Sat.  (//  MONSAT ) 

INPUT:  Set  of  clauses  C,  each  with  K  disjuncted  literals,  no  literals  are  negations,  L 
is  the  set  of  atoms. 

OUTPUT:  Number  of  subsets  of  L  such  that  if  the  atoms  in  the  subset  are  true,  all  of 
the  clauses  in  C  are  satisfied. 

Counting  K-Monotone  CNF  Sat.  -  Exact  (#MONSAT-EQ) 

INPUT:  Set  of  clauses  C,  each  with  K  disjuncted  literals,  no  literals  are  negations,  L 
is  the  set  of  atoms  and  natural  number  to. 

OUTPUT:  Number  of  subsets  of  L  -  each  with  cardinality  of  exactly  m  -  such  that  if 
the  atoms  in  the  subset  are  true,  all  of  the  clauses  in  C  are  satisfied. 

We  now  define  the  following  problem  used  in  the  proof: 

#SNDOP-EQ 

INPUT:  Same  as  SNDOP-DEC. 

OUTPUT:  Number  of  pre-answers  V'  that  would  causes  a  “yes”  answer  to  SNDOP- 
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DEC  and  \  V'\  =  k. 

CLAIM  1:  #MONSAT<p#MONSAT-EQ  and  #MONSAT-EQ  is  #P-hard  Consider  the 
following  construction  (CONSTRUCTION  1): 

Let  L  be  the  set  of  atoms  associated  with  #MONSAT.  Create  \L\  instances  of 
#MONSAT-EQ  -  each  with  a  cardinality  constraint  (m)  in  [1,  |L|],  and  the  remainder  of 
the  input  the  same  as  #  MON  SAT. 

(Proof  of  claim  1):  The  sum  of  the  solution  to  the  \L\  instances  of  #  MONSAT-EQ  is 
equal  to  the  solution  to  #MONSAT. 

Every  possible  satisfying  assignment  counted  as  a  solution  to  #MONSAT  has  a  unique 
cardinality  associated  with  it,  which  is  in  [1,  |L|].  The  claim  follows  trivially  from  this 
fact  and  construction  1  (which  can  be  performed  in  PTIME). 

CLAIM  2:  #MONSAT-EQ<p#SNDOP-EQ  and  #SNDOP-EQ  is  #P-hard 
Consider  the  following  construction  (CONSTRUCTION  2): 

Given  #MONSAT-EQ  input  (C,  K.  L,  m),  we  create  an  instance  of  #SNDOP-EQ  as  fol¬ 
lows. 

(1)  Set  up  social  network  S  as  follows: 

(a)  EP  =  {edge} 

(b)  VP  =  {vertex} 

(c)  For  every  element  of  C,  and  every  element  of  L,  we  create  an  element  of  V.  We 
shall  denote  subsets  of  V,  Vc  and  VL  as  the  vertices  corresponding  with  C  and 
L  respectively.  For  some  a  €  C,  va  is  the  corresponding  vertex.  For  some  b  e  L, 
Vb  is  the  corresponding  vertex. 

(d)  For  each  a  G  C,  if  b  is  in  clause  C,  add  edge  (vb,  va)  to  set  E 

(e)  For  each  v  €  V,  £vert{v )  =  vertex 

(f)  For  each  (■ v,v ')  €  E,  £edge{v,v')  =  edge 

(g)  For  each  (v,v')  e  E,  w(v,v')  =  1 

(2)  Set  up  program  II  as  follows: 

(a)  Embed  S  into  II 

(b)  For  each  v  €  V,  add  fact  vertex(v)  :  0  to  II 

(c)  Add  diffusion  rule  vertex{v)  :  1  vertex{v')  :  1  A  edge(v',v)  :  1  to  II 

(3)  Set  up  SNDOP-query  Q  as  follows: 

(a)  agg  =  SUM 

(b)  VC  =  0 

(c)  k  =  m  (the  m  from  ^tMONSAT-EQ) 

(d)  g  =  vertex 

(e)  target  =  \C\  +  k 

CLAIM  2.1:  Construction  2  can  be  performed  in  PTIME. 

Straightforward. 

CLAIM  2.2:  If  there  is  a  solution  to  given  an  instance  of  MONSAT-EQ,  then  given 
construction  2  as  input,  SNDOP-EQ  will  return  “yes”.  For  each  a  £  Lin  the  solution  to 
MONSAT-EQ,  change  the  annotation  of  vertex{va)  to  1  in  Ufacts.  There  are  m  =  k  such 
vertices.  By  the  construction,  this  will  cause  the  \C\  vertices  of  Vc  to  increase  their 
annotation  -  resulting  in  an  aggregate  of  \C\  +  k,  causing  SNDOP-EQ  to  return  “yes”. 

CLAIM  2.3:  If,  given  construction  2  as  input,  SNDOP-EQ  returns  “yes”,  then  a  solution 
to  given  an  instance  of  MONSAT-EQ  such  that  k  is  the  cardinality  of  the  solution. 

We  note  that  selecting  any  vertex  in  V'  not  in  17.  will  result  in  an  ualue(  V)  <  \C\  +  k, 
as  fewer  than  \C\  nodes  will  have  their  annotation  increase  after  running  II.  The  only 
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way  to  achieve  an  value(V')  =  \C\  +  k  is  if  there  exists  a  set  of  k  vertices  in  Vl  such 
that  there  is  an  outgoing  edge  from  at  least  one  of  the  picked  vertices  to  each  node  in 
Vc-  This  is  only  possible  if  there  exists  a  solution  to  the  MONSAT-EQ  problem. 

CLAIM  2.4:  There  is  a  1-1  correspondence  between  solution  to  MONSAT-EQ  and 
SNDOP-EQ  using  construction  2. 

As  each  literal  in  a  MONSAT-EQ  solution  corresponds  to  exactly  one  vertex  in  a 
SNDOP-EQ,  and  by  claims  2. 2-2. 3,  the  claim  follows. 

PROOF  OF  CLAIM  2:  Follows  directly  from  claims  2. 1-2.4. 

CLAIM  3:  #SNDOP-EQ<p#SNDOP,  #SNDOP  is  #P-hard 
Consider  the  following  construction  (CONSTRUCTION  3): 

Let  k  be  the  cardinality  constraint  associated  with  #SNDOP-EQ.  Create  two  instances 
of  #SNDOP,  one  with  a  cardinality  constraint  of  k  and  one  with  the  constraint  of  k  —  1, 
and  the  remainder  of  the  input  is  the  same  as  #SNDOP-EQ. 

PROOF  OF  CLAIM  3:  First,  note  that  construction  3  can  be  performed  in  PTIME.  We 
show  that  the  solution  to  #SNDOP  with  cardinality  constraint  k  -  1  subtracted  from 
the  solution  to  #SNDOP  with  cardinality  constraint  k  is  the  solution  to  #SNDOP-EQ. 
As  the  solution  to  #SNDOP  with  cardinality  constraint  fc  —  1  is  the  number  of  all 
W’s  that  are  a  solution  with  cardinality  of  k  —  1  or  less,  and  the  solution  to  #SNDOP 
with  cardinality  constraint  k  is  the  number  of  all  U’s  that  are  a  solution  with  cardi¬ 
nality  of  k  or  less,  the  difference  is  the  number  of  all  U’s  with  a  cardinality  of  exactly  k. 

PROOF  OF  LEMMA:  Follows  directly  from  claims  3.  □ 

LEMMA  A.2.  If  the  aggregate  function  agg  is  polynomially  computable  and  func¬ 
tions  in  T  are  polynomially  computable,  then  ffSNDOP  is  in-ffP. 

PROOF.  We  use  the  two  requirements  for  membership  in-#P  as  presented  in  [Kozen 
1991]. 

(i)  Witnesses  must  be  verifiable  in  PTIME  (shown  in  the  NP-Completness  of  a  SNDOP- 
query). 

(ii)  The  number  of  solutions  to  #SNDOP  is  bounded  by  x,k  -  where  k!  is  a  constant.  We 
know  that  the  number  of  solutions  is  bounded  by  (^)  which  is  less  than  c  •  V | k 
for  some  constant  c.  □ 

A.10.  Proof  of  Theorem  3.22 

Given  a  SNDOP  query  Q  =  (agg,  VC,k,gi(V),go(V)),  a  social  network  S,  and  a  GAP 
II  D  IIs),  there  exists  a  polynomial-time  algorithm  with  an  oracle  to  SNDOP-ALL 
which  answers  Q. 

Proof.  We  shall  refer  to  the  problem  of  finding  (jv,  ganS(Q)  Vfns  as  SNDOP-ALL. 
We  show  that  SNDOP-ALL  is  <p  solving  a  SNDOP-query. 

Given  set  an  instance  of  SNDOP-ALL  and  vertex  set  V*,  |V*|  <  k  let  SNDOP- 
ALL(W)  be  the  modification  of  of  the  instance  of  SNDOP-ALL  where  the  value  k  is 
reduced  by  |V*|  and  for  each  v*  e  V*,  the  fact  gi(vf)  :  1  is  added  to  II. 

Consider  the  following  informal  algorithm  (FIND-SET)  that  takes  an  instance  of 
SNDOP-ALL  (Q)  and  some  vertex  set  V*,  |V*|  <  k. 

(1)  If  |  V*  |  =  k,  return  V* 

(2)  Else,  solve  SNDOP- ALLfl'* ),  returning  set  V" . 

(a)  If  V"  —  V*  =  0,  return  V* 
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(b)  Else,  pick  v  £  V"  —  V*  and  return  FIND-SET(Q,  V*  U  v) 

Note,  that  the  above  algorithm  can  only  iterate  k  times. 

CLAIM  1:  The  V*  returned  by  FIND-SET  is  a  valid  solution  to  the  SNDOP-query  (with 
the  same  input  for  Q). 

First,  we  number  the  elements  in  V*  as  v\ , . . . ,  vsize  -  where  v\  is  picked  as  the  first 
element  in  the  solution  and  vertex  vz  is  added  at  the  /th  recursive  call  of  FIND-SET. 
We  know  that  size  <  k 

BASE  CASE:  There  is  a  set  of  vertices  of  size  <  size  that  is  a  solution  to  the  SNDOP- 
query  s.t.  vertex  v\  is  in  that  set  -  follows  directly  from  the  definition  of  SNDOP-ALL. 
INDUCTIVE  HYPOTHESIS:  For  some  k'  <  size,  we  assume  that  for  vertices 
tq, . . . ,  Vk'-i  there  is  some  set  of  vertices  of  size  <  k  that  is  a  solution  to  the  SNDOP- 
query  s.t.  vertices  v\, . . .  ,vk'-i  are  in  that  set. 

INDUCTIVE  STEP:  For  some  k!  <  size,  consider  vertices  v\, . . . ,  vk' .  By  the  inductive 
hypothesis,  vertices  Vi,...  ,vk'-i  are  in  a  <  fc-sized  solution.  By  the  construction,  and 
the  definition  of  SNDOP-ALL,  we  know  that  vertex  vki  must  also  be  in  that  set  as 
well. 

CLAIM  2:  Given  some  V'  as  a  solution  to  the  SNDOP-query,  the  algorithm  FIND-SET 
can  be  run  in  such  a  way  to  return  that  set. 

Number  each  vertex  in  V'  as  v\, . . . ,  vsize.  By  the  definition  of  SNDOP-ALL,  upon  the 
/’th  call  to  FIND-SET,  we  are  guaranteed  that  the  vertices  v, , . . .  ,vsize  will  be  in  set 
V" .  Simply  pick  vertex  vt  follow  the  algorithm  to  the  next  recursive  call,  the  claim 
immediately  follows. 

PROOF  OF  PROPOSITION:  Note  the  construction  can  be  accomplished  in  PTIME. 
The  proposition  follows  directly  from  claims  1-2.  □ 

A.11.  Proof  of  Theorem  3.23 

Given  a  SNDOP  query  Q  =  (agg,  VC,  k,  gr(V)-  go  (  V )  ),  a  social  network  S,  and  a  GAP 
II  D  ILs),  finding  Uv'eans(Q)  ^  reduces  to  |V|  +  1  SNDOP  queries,  where  V  is  the  set  of 
vertices  of  S. 

Proof.  We  set  up  \V\  SNDOP-queries  as  follows: 

—  Let  kaii  be  the  k  value  for  the  SNDOP-ALL  query  and  and  for  each  SNDOP-query  i, 
let  be  the  k  for  that  query.  For  each  query  i,  set  k,  =  kaii  —  1. 

—  Number  each  element  of  Vi  €  V  such  that  gi(v,)  and  VC(vi )  are  true.  For  the  /th 
SNDOP-query,  let  Vi  be  the  corresponding  element  of  V 

—  Let  II,  refer  to  the  GAP  associated  with  the  ith  SNDOP-query  and  naH  be  the  pro¬ 
gram  for  SNDOP-ALL.  For  each  program  IR,  add  fact  gi(vi)  :  1 

—  For  each  SNDOP-query  i,  the  remainder  of  the  input  is  the  same  as  for  SNDOP-ALL. 

After  the  construction,  do  the  following: 

(1)  We  shall  refer  to  a  SNDOP-query  that  has  the  same  input  as  SNDOP-ALL  as  the 
“primary  query.”  Let  V'lns  (prl>  be  an  answer  to  this  query  and  value  ( VlirJvr''] )  be  the 
associated  value. 

(2)  For  each  SNDOP-query  i,  let  V'arJl)  be  an  answer  and  value(V^ns^)  be  the  associ¬ 
ated  value. 

(3)  Let  V",  the  solution  to  SNDOP-ALL  be  initialized  as  0. 

(4)  For  each  SNDOP-query  i,  if  vakue(V,'nJ^)  =  ra(Me(Ua,ns^pr^),  then  add  vertex  Vi  to 
V". 
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CLAIM  1:  If  for  the  /th  SNDOP-query,  if  value(V^ns^)  =  value(V^ns^pr^),  then  Vi  must 
be  in  the  solution  to  SNDOP-ALL. 

Suppose,  by  way  of  contradiction,  that  for  the  /th  query,  value(V^ns^)  = 
value (Yans^’7"1'1),  but  Vi  is  not  in  the  solution  to  SNDOP-ALL.  Then,  there  is  no  V  of 
size  <  k  s.t.  Vi  €  V'  and  V  is  an  answer  to  a  the  primary  SNDOP-query.  However, 
this  is  a  contradiction,  as  given  t>,:  and  the  vertices  returned  by  the  /th  query,  we  are 
guaranteed  this  to  be  a  valid  answer  to  the  primary  query. 

CLAIM  2:  For  each  v,  in  a  solution  to  SNDOP-ALL,  the  /th  SNDOP  query  returns  a 
value  s.t.  value(V^nsM)  =  value{V^ns(pn)). 

Suppose,  by  way  of  contradiction,  that  there  is  some  Vi  in  the  solution  to  SNDOP-ALL 
s.t.  the  /th  query  returns  a  value  that  is  not  equal  to  the  value  returned  by  the  primary. 
However,  by  the  definition  of  SNDOP-ALL,  this  is  not  possible,  hence  a  contradiction. 
PROOF  OF  PROPOSITION:  Note  the  construction  can  be  accomplished  in  PTIME. 
The  proposition  follows  directly  from  claims  1-2.  □ 

B.  PROOFS  FOR  SECTION  5 
B.l.  Proof  of  Proposition  5.4 

Suppose  II  is  any  GAP.  Then: 

(1)  Sn  is  mono  tonic. 

(2)  Sn  has  a  least  fixpoint  lfp{ Sn)  and  lfp( Tn)  =  grd(lfp( Sn)).  That  is,  lfp{ Sn)  is  a 
non-ground  representation  of  the  (ground)  least  fixpoint  operator  Tn- 

PROOF.  Part  1  follows  directly  from  the  definition  -  for  a  given  atom  A  and  inter¬ 
pretation  I,  S  (I)  (A)  >  I  {A). 

Part  2  follows  directly  from  the  definitions  of  S  and  T.  □ 

B.2.  Proof  of  Theorem  5.6 

Given  SNDOP  query  Q  =  (. agg,VC,  k,  gi(V),  go(V ))  and  a  GAP  II  embedding  a  social 
network  S,  if  agg  is  monotonic  then: 

•  There  is  an  answer  to  the  SNDOP  query  Q  w.r.t.  II  iff  SNDOP- 
Mon(II,  agg,  VC,  k,  gi(V),  go(V))  does  not  return  NIL. 

•  If  SNDOP-Mon(II,  agg,  VC,  k,  gi{V),go{V))  returns  any  result  other  than  NIL,  then 
that  result  is  an  answer  to  the  SNDOP  query  Q  w.r.t.  II. 

Proof.  Part  1  (^=):  Suppose  there  is  an  answer  to  the  query  and  SNDOP- 
Mon  returns  NIL.  Then  there  is  some  set  of  vertices,  sol  of  cardinality  <  k,  s.t. 
n  U  Ui>6so/  gi(v)  :  1  | =  VC.  However,  such  a  set  would  obviously  have  been  added  as  a 
tuple  into  Todo  at  step  2  or  step  4(c)iB.  Hence,  a  contradiction. 

Part  1  (=»):  Suppose  there  is  no  answer  to  the  query  and  SNDOP-Mon  returns  NIL. 
Then,  there  is  no  set  of  vertices,  sol  of  cardinality  <  k,  s.t.nuU„£MlS/(»):lNVC. 
SNDOP-Mon  performs  such  a  check  at  line  4b.  Hence,  a  contradiction. 

Part  2:  Suppose,  BWOC,  there  exists  a  set  of  vertices  that  is  a  solution,  sol,  of  car¬ 
dinality  <  k,  s.t.  Uuesol{gi(v)  :  1}  is  not  what  is  returned  by  SNDOP-Mon  and 
value( n  U  \Jveeol{gi(v)  :  1}  is  greater  than  bestVal.  We  note  that  SNDOP-Mon  con¬ 
siders  most  sets  of  vertices  of  cardinality  <  k.  Further,  the  monotonicity  of  agg  and 
line  4(c)i  tell  us  that  the  only  solutions  not  considered  are  ones  guaranteed  to  have  a 
value  less  than  bestVal  -  hence,  a  contradiction.  □ 
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B.3.  Proof  of  Proposition  5.7 

Given  a  SNDOP  query  Q  =  ( agg ,  VC,  k,  gi(V),go(V)),  a  social  network  S,  and  a  GAP 
n  D  IIs,  the  complexity  of  GREEDY-SNDOP  is  0(k-  |V|  •  PdVD)  where  P(|V|)  is  the  time 
complexity  to  compute  valued')  for  some  set  V'  C  V  of  size  k. 

PROOF.  The  outer  loop  at  line  2  iterates  k  times,  the  inner  loop  at  line  2b  iterates 
0(|V|)  times,  and  at  each  inner  loop,  at  line  2(b)i,  the  function  value  is  computed  with 
costs  F(|V|).  The  statement  follows.  □ 

B.4.  Proof  of  Theorem  5.8 

Given  a  SNDOP  query  Q  =  (agg,  VC,  k,  gr(V),  go (V )),  a  social  network  S,  and  a  GAP 

n  D  n5,  if 

—  II  is  a  linear  GAP 

—  Q  is  a-priori  VC 

—  agg  is  positive-linear 

—  value  is  zero-starting. 

then  GREEDY-SNDOP  is  an  (^-) -approximation  algorithm. 

PROOF.  [Nemhauser  et  al.  1978]  proposes  a  greedy  algorithm  to  solve  the  general 
problem  of  finding  an  element  of  a  uniform  matroid  that  maximizes  a  non-decreasing, 
submodular  function  F  (defined  over  the  elements  of  the  matroid)  s.t.  F(0)  =  0.  The 
algorithm  is  very  simple,  yet  guarantees  an  approximation:  it  incrementally  builds 
a  solution  (without  backtracking)  starting  with  the  empty  set;  in  each  iteration  it  adds 
an  element  that  most  improves  the  current  solution  (according  to  F). 

Answering  a  SNDOP  query  is  the  problem  of  finding  a  pre-answer  with  maximum 
value.  We  show  that,  under  the  assumptions  stated  in  the  claim,  the  set  of  pre-answers 
is  a  uniform  matroid  and  value  satisfies  the  restrictions  stated  above  for  F  (which  en¬ 
ables  us  to  use  the  greedy  algorithm  of  [Nemhauser  et  al.  1978]).  The  hypothesis  that  Q 
is  a-priori  VC  entails  that  the  set  of  pre-answers  is  a  uniform  matroid  by  Lemma  3.14. 
The  hypothesis  agg  is  positive-linear  entails  that  agg  is  monotonic  (see  Proposition  3.9); 
the  latter  in  turn  entails  that  value  is  monotonic  (see  Lemma  3.13).  The  first,  second, 
and  third  hypotheses  in  the  claim  entail  that  value  is  submodular,  by  Theorem  3.15. 
Recall  that  the  fourth  hypothesis  means  valued)  =  0  by  definition. 

Hence,  under  the  conditions  stated  in  the  claim,  answering  a  SNDOP  query  is  an 
instance  of  the  problem  addressed  in  [Nemhauser  et  al.  1978].  Since  GREEDY-SNDOP 
is  a  specialization  of  the  algorithm  in  [Nemhauser  et  al.  1978]  applied  to  our  setting, 
it  follows  that  GREEDY-SNDOP  is  an  ( -approximation  algorithm.  □ 
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